Introduction

This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

Directory Structure

The default directory structure is as follows. This includes output from both denovo and reference methods.

{outdir}
├── denovo
│   ├── alignments
│   │   └── bwamem2
│   │       ├── bam
│   │       ├── index
│   │       ├── intervals
│   │       │   ├── bedops_merge
│   │       │   ├── bedtools_bamtobed
│   │       │   ├── bedtools_coverage
│   │       │   ├── bedtools_intersect
│   │       │   ├── bedtools_makewindows
│   │       │   ├── bedtools_merge
│   │       │   ├── bedtools_sort
│   │       │   └── create_intervals
│   │       ├── samtools_index
│   │       ├── samtools_merge
│   │       ├── samtools_stats
│   │       └── umitools_dedup
│   │           └── stats
│   ├── reference
│   │   ├── cdhit
│   │   ├── cdhit_to_rbdiv
│   │   ├── fastp
│   │   ├── rainbow_div
│   │   ├── rainbow_merge
│   │   ├── samtools
│   │   │   └── index
│   │   ├── seqtk
│   │   ├── unique_sequences
│   │   └── write_fasta
│   ├── trim_fastp
│   └── variant_calling
│       └── intervals
├── fastp
├── fastqc
├── multiqc
│   └── multiqc_data
├── pipeline_info
└── reference
    ├── alignments
    │   └── bwamem2
    │       ├── bam
    │       ├── index
    │       ├── intervals
    │       │   ├── bedops_merge
    │       │   ├── bedtools_bamtobed
    │       │   ├── bedtools_coverage
    │       │   ├── bedtools_intersect
    │       │   ├── bedtools_makewindows
    │       │   ├── bedtools_merge
    │       │   ├── bedtools_sort
    │       │   └── create_intervals
    │       ├── samtools_index
    │       ├── samtools_merge
    │       ├── samtools_stats
    │       └── umitools_dedup
    │           └── stats
    ├── samtools
    │   └── index
    └── variant_calling
        └── intervals

Pre-Processing

Radseq pre-processes reads prior to the alignment step.

FastP

FastP is a tool designed to be an all-in-one preprocessor for FastQ files. You can enable the saving of trimmed fq files in output directory through --save_trimmed=true.

Output files
  • {outdir}/fastp/
    • *.fq.gz: trimmed fq files

Denovo Reference Construction

Radseq supports the construction of psuedoreference using a conglomerate of open source tools. By default the resulting fasta file is only outputed to enable the printing of intermediate files into {outdir}/denovo/reference/ use --denovo_intermediate_files=true.

Prepare forward reads

Prior to clustering forward and reverse reads are joined into one sequence and seperated with a NNNNNNNNNN. Reads are reduced based on number of presences within an individual and among all individuals and combined into one file.

Output files
  • {outdir}/denovo/reference/unique_sequences
    • *.uniq.seqs: all unique sequences in an individual
    • *_uniq.full.fasta: all sequences

Seqtk seq

Seqtk is a tool for processing FASTA or FASTQ file formats.

Output files
  • {outdir}/denovo/reference/seqtk
    • *.seqtk-seq: dummy fasta file

Denovo FastP

Denovo Fastp uses FastP as a tool to trim any unwanted sequences prior to clustering like adapter content.

Output files
  • {outdir}/denovo/reference/fastp
    • *.uniq.fasta: fasta format
    • *.totaluniqseq: all unique sequences remaining after data cutoffs and adapter trimming

Cdhit est

CD-HIT is used for clustering and comparing nucleotide sequences.

Output files
  • {outdir}/denovo/reference/cdhit
    • *_cdhit.logs: log output from cdhit-est
    • *.clstr: clstr output used to convert into rainbow div format

CD-HIT to Rainbow div

This module converts CD-HIT output into input for Rainbow div.

Output files
  • {outdir}/denovo/reference/cdhit_to_rbdiv
    • *.sort.contig.cluster.ids: intermediate file used for conversion of cd-hit to Rainbow input and facilitates reproducibility
    • *.contig.cluster.totaluniqseq: used during assembly
    • *.rcluster: input for Rainbow

Rainbow div

Rainbow div is a tool used to divide heterozygote calls into putative haplotypes based on minimum thresholds passed as arguments.

Output files
  • {outdir}/denovo/reference/rainbow_div
    • *_rbdiv.out: rainbow div output file
    • *_rbdiv.log: log file

Rainbow merge

Rainbow merge is a tool used to merge reads from Rainbow div and assemble into contigs based on minimum and maximum thresholds that are passed as arguments.

Output files
  • {outdir}/denovo/reference/rainbow_merge
    • *_rainbow.fasta: final fasta file used in subsequent processes

Write fasta

This module converts Rainbow merge into fasta format. These files are outputed by default to:

Output files
  • {outdir}/denovo/reference/write_fasta
    • *_rainbow.fasta: denovo fasta file containing contig sequences

Alignment

Indices

enable the saving of reference indices with --save_reference_indices true generate from samtools and bwa for variant calling and short-read alignment respectiviely.

samtools faidx

Output files
  • {outdir}/{method}/reference/{samtools}/index/
    • *.fai: samtools fai index

bwa index

Not Working

Output files
  • {outdir}/{method}/reference/{aligner}/index/
    • *: rainbow merge output file

Aligners

To enable the output of bam files use --save_bam_files=true. Output of bam files from different alignment methods follow the same output structure.

Output files
  • {outdir}/{method}/reference/{aligner}/bam/
    • *.bam: sorted bam files

BWA

BWA is a tool for mapping sequencies with low divergence against a reference genome. Aligned reads are then potentially filtered and coordinate-sorted using samtools.

BWA-mem2

BWA-mem2 is a tool next version of bwa-mem for mapping sequencies with low divergence against a reference genome with increased processing speed (~1.3-3.1x). Aligned reads are then potentially filtered and coordinate-sorted using samtools.

UMI-tools dedup

UMI-tools dedup is a tool for deduplicating reads and ensuring only a single representative read is retained in the bam file. Files are outputed using --save_bam_files=true as a result of many individual bed files being passed through.

Output files
  • {outdir}/{method}/alignments/{aligner}/umitools_dedup
    • *.bam: bam file
    • *.tsv: output statistics file

FreeBayes Interval Construction

radseq supports multithreading of FreeBayes through the bam_intervals_bedtools.nf subworkflow that uses a collection of BEDtools software and BEDOPS merge.

To enable the saving of files generated for interval construction in the output-folder set --save_bed_intervals=true.

For large datasets, it may be useful to randomly subsample the number of individuals going into bam_intervals_bedtools.nf subworkflow using --subset_intervals_channel=<integer>. Particularly at the BEDtools merge module the memory footprint can be large.

BEDtools bamtobed

Output files
  • {outdir}/{method}/alignments/{aligner}/bedtools_bamtobed
    • *.bed: individual bed file

BEDOPS merge

Output files
  • {outdir}/{method}/alignments/{aligner}/bedops_merge
    • *.bed: merged single bed file

BEDtools sort

Output files
  • {outdir}/{method}/alignments/{aligner}/bedtools_sort
    • *.bed: single sorted bed file

BEDtools coverage

Output files
  • {outdir}/{method}/alignments/{aligner}/bedtools_coverage
    • *.cov: Individual bed files with fourth column describing number of reads in a particular region

BEDtools merge cov

Output files
  • {outdir}/{method}/alignments/{aligner}/bedtools_merge
    • *.cov: single bed file with fourth column describing number of reads in a particular region

BEDtools makewindows

Output files
  • {outdir}/{method}/alignments/{aligner}/bedtools_makewindows
    • *_cov.low.stats: regions less than max_read_coverage_to_split
    • *_cov.high.stats: regions equal to or greater than --max_read_coverage_to_split
    • *.tab: regions from *_cov.high.stats split to half their read length

BEDtools intersect

Output files
  • {outdir}/{method}/alignments/{aligner}/bedtools_intersect
    • *.bed: Final single bed file to be subsequently split

Create intervals

Output files
  • {outdir}/{method}/alignments/{aligner}/bedtools_intersect
    • mapped.*.bed: Split bed files passed into FreeBayes for multithreading

Variant Calling

FreeBayes

FreeBayes is a bayesian genetic variant detector capable of genotyping SNPs, indels, MNPs, and complex events smaller than the length of short-read sequencing alignment.

By default radseq will output a single vcf joined from independent runs. To enable the outputting of VCF files based on regions passed to FreeBayes use --save_freebayes_intervals=true.

Output files
  • {outdir}/{method}/variant_calling/

    • *.vcf.gz: Final VCF
  • {outdir}/{method}/variant_calling/intervals

    • *_*.vcf.gz: unsorted VCF intervals
    • *_sort_*.vcf.gz: sorted VCF intervals
    • *_sort_*.vcf.gz.tbi: sorted VCF intervals tabix index

Quality Control and Visualization

bcftools stats

bcftools stats is a tool for collecting statistics from a VCF or BCF file that can be interpretted by MultiQC.

Output files
  • {outdir}/{method}/variant_calling/
    • *_stats.txt:

FastP

FastP is a tool designed to be an all-in-one preprocessor for FastQ files. Statistics are passed to MultiQC.

Output files
  • {outdir}/fastp/
    • *_fastp.html: Fastp report containing quality metrics
    • *_fastp.log: Log output containing statistics

samtools

radseq supports generation of statistics from alignment files with samtools stats, samtools flagstat, samtools idxstats.

Output files
  • {outdir}/{method}/alignments/{aligner}/samtools_stats
    • *.flagstat:
    • *.stats:
    • *.idxstats:

FastQC

Output files
  • fastqc/
    • *_fastqc.html: FastQC report containing quality metrics.
    • *_fastqc.zip: Zip archive containing the FastQC report, tab-delimited data file and plot images.

FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.

MultiQC - FastQC sequence counts plot

MultiQC - FastQC mean quality scores plot

MultiQC - FastQC adapter content plot

NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.

MultiQC

Output files
  • multiqc/
    • multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
    • multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
    • multiqc_plots/: directory containing static images from the report in various formats.

MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.