nf-core/viralrecon

This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline. Please click here to see an example MultiQC report generated using the parameters defined in this configuration file to run the pipeline on samples which were prepared from the ncov-2019 ARTIC Network V1 amplicon set and sequenced on the Illumina MiSeq platform in 301bp paired-end format.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

Preprocessing

parallel-fastq-dump

Please see the usage docs for a list of supported public repository identifiers and how to provide them to the pipeline. The final sample information for all identifiers is obtained from the ENA which provides direct download links for FastQ files as well as their associated md5sums. If a download link exists, the files will be downloaded by FTP otherwise they will be downloaded using parallel-fastq-dump.

Output files
  • preprocess/sra/
    • sra_run_info.tsv: Run information file for all samples to be downloaded from the ENA/SRA.
    • *.fastq.gz: Paired-end/single-end reads downloaded and extracted from the ENA/SRA.
  • preprocess/sra/md5/
    • *.md5: Files containing md5 sum for FastQ files downloaded from ENA/SRA.
  • preprocess/sra/log/
    • *.fastq_dump.log: Log file generated from stdout whilst running parallel-fastq-dump.

NB: Downloaded FastQ files will only be saved in the results directory if the --save_sra_fastq parameter is supplied.

cat

If multiple libraries/runs have been provided for the same sample in the input samplesheet (e.g. to increase sequencing depth) then these will be merged at the very beginning of the pipeline in order to have consistent sample naming throughout the pipeline. Please refer to the usage docs to see how to specify these samples in the input samplesheet.

FastQC

FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.

Output files
  • preprocess/fastqc/
    • *_fastqc.html: FastQC report containing quality metrics.
  • preprocess/fastqc/zips/
    • *_fastqc.zip: Zip archive containing the FastQC report, tab-delimited data file and plot images.

NB: The FastQC plots in this directory are generated relative to the raw, input reads. They may contain adapter sequence and regions of low quality. To see how your reads look after trimming please refer to the FastQC reports in the preprocess/fastp/fastqc/ directory.

MultiQC - FastQC per base sequence plot

fastp

fastp is a tool designed to provide fast, all-in-one preprocessing for FastQ files. It has been developed in C++ with multithreading support to achieve higher performance. fastp is used in this pipeline for standard adapter trimming and quality filtering.

Output files
  • preprocess/fastp/
    • *.fastp.html: Trimming report in html format.
    • *.fastp.json: Trimming report in json format.
    • *.trim.fastq.gz: Paired-end/single-end trimmed reads.
    • *.trim.fail.gz: Unpaired trimmed reads (only for paired-end data).
  • preprocess/fastp/log/
    • *.fastp.log: Trimming log file.
  • preprocess/fastp/fastqc/:
    • *.trim_fastqc.html: FastQC report of the trimmed reads.
  • preprocess/fastp/fastqc/zips/
    • *.trim_fastqc.zip: Zip archive containing the FastQC report.

NB: Post-trimmed FastQ files will only be saved in the results directory if the --save_trimmed parameter is supplied.

MultiQC - fastp filtered reads plot

Variant calling

A file called summary_variants_metrics_mqc.tsv containing a selection of read and variant calling metrics will be saved in the variants/ results directory. The same metrics have also been added to the top of the MultiQC report.

Bowtie 2

Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. Bowtie 2 supports gapped, local, and paired-end alignment modes.

Output files
  • variants/bam/
    • <SAMPLE>.bam: Original BAM file created by Bowtie 2. Only present if --save_align_intermeds parameter is supplied.
  • variants/bam/log/
    • <SAMPLE>.bowtie2.log: Bowtie 2 mapping log file.

MultiQC - Bowtie2 alignment score plot

SAMtools

Bowtie 2 BAM files are further processed with SAMtools to sort them by coordinate, for indexing, as well as to generate read mapping statistics.

Output files
  • variants/bam/
    • <SAMPLE>.sorted.bam: Coordinate sorted BAM file containing read alignment information.
    • <SAMPLE>.sorted.bam.bai: Index file for coordinate sorted BAM file.
  • variants/bam/samtools_stats/
    • SAMtools <SAMPLE>.sorted.bam.flagstat, <SAMPLE>.sorted.bam.idxstats and <SAMPLE>.sorted.bam.stats files generated from the alignment files.

NB: BAM files and their associated indices will only be saved in the results directory if the --save_align_intermeds parameter is supplied.

MultiQC - SAMtools alignment scores plot

iVar trim

If the --protocol amplicon parameter is provided then iVar is used to trim amplicon primer sequences from the aligned reads. iVar uses the primer positions supplied in --amplicon_bed to soft clip primer sequences from a coordinate sorted BAM file.

Output files
  • variants/bam/
    • <SAMPLE>.trim.sorted.bam: Coordinate sorted BAM file after primer trimming.
    • <SAMPLE>.trim.sorted.bam.bai: Index file for coordinate sorted BAM file after primer trimming.
  • variants/bam/samtools_stats/
    • SAMtools <SAMPLE>.trim.flagstat, <SAMPLE>.trim.idxstats and <SAMPLE>.trim.stats files generated from the primer trimmed alignment files.
  • variants/bam/log/
    • <SAMPLE>.trim.ivar.log: iVar trim log file obtained from stdout.

NB: Post-trimmed BAM files and their associated indices will only be saved in the results directory if the --save_align_intermeds parameter is supplied.

MultiQC - iVar trim primer heatmap

picard MarkDuplicates

Unless you are using UMIs it is not possible to establish whether the fragments you have sequenced from your sample were derived via true biological duplication (i.e. sequencing independent template fragments) or as a result of PCR biases introduced during the library preparation. By default, the pipeline uses picard MarkDuplicates to mark the duplicate reads identified amongst the alignments to allow you to guage the overall level of duplication in your samples. However, you can also choose to remove any reads identified as duplicates via the --filter_dups parameter.

Output files
  • variants/bam/
    • <SAMPLE>.<SUFFIX>.sorted.bam: Coordinate sorted BAM file after duplicate marking.
    • <SAMPLE>.<SUFFIX>.sorted.bam.bai: Index file for coordinate sorted BAM file after duplicate marking.
  • variants/bam/samtools_stats/
    • SAMtools <SAMPLE>.<SUFFIX>.flagstat, <SAMPLE>.<SUFFIX>.idxstats and <SAMPLE>.<SUFFIX>.stats files generated from the duplicate marked alignment files.
  • variants/bam/picard_metrics/
    • <SAMPLE>.<SUFFIX>.MarkDuplicates.metrics.txt: Metrics file from MarkDuplicates.

NB: The value of <SUFFIX> in the output file names above will depend on the preceeding steps that were run in the pipeline. If --protocol amplicon is specified then this process will be run on the iVar trimmed alignments and the value of <SUFFIX> will be trim.mkD. However, if --protocol metagenomic is specified then the process will be run on the alignments obtained directly from Bowtie 2 and the value of <SUFFIX> will be mkD; where mkD is an abbreviation for MarkDuplicates.

MultiQC - Picard MarkDuplicates metrics plot

picard CollectMultipleMetrics

picard-tools is a set of command-line tools for manipulating high-throughput sequencing data. We use picard-tools in this pipeline to obtain mapping and coverage metrics.

Output files
  • variants/bam/picard_metrics/
    • <SAMPLE>.<SUFFIX>.CollectMultipleMetrics.*: Alignment QC files from picard CollectMultipleMetrics in *_metrics textual format and plotted in *.pdf format.
    • <SAMPLE>.<SUFFIX>.CollectWgsMetrics.coverage_metrics: Coverage metrics file from CollectWgsMetrics.

NB: The value of <SUFFIX> in the output file names above will depend on the preceeding steps that were run in the pipeline. If --protocol amplicon is specified then this process will be run on the iVar trimmed alignments and the value of <SUFFIX> will be trim.mkD. However, if --protocol metagenomic is specified then the process will be run on the alignments obtained directly from Bowtie 2 and the value of <SUFFIX> will be mkD; where mkD is an abbreviation for MarkDuplicates.

MultiQC - Picard whole genome coverage plot

MultiQC - Picard insert size plot

mosdepth

mosdepth is a fast BAM/CRAM depth calculation for WGS, exome, or targeted sequencing. mosdepth is used in this pipeline to obtain genome-wide coverage values in 200bp windows and for --protocol amplicon to obtain amplicon/region-specific coverage metrics. The results are then either rendered in MultiQC (genome-wide coverage) or are plotted using custom R scripts.

Output files
  • variants/bam/mosdepth/genome/
    • <SAMPLE>.<SUFFIX>.genome.mosdepth.global.dist.txt: A distribution of proportion of bases covered at or above a given threshold for each chromosome and genome-wide.
    • <SAMPLE>.<SUFFIX>.genome.mosdepth.region.dist.txt: A distribution of proportion of bases covered at or above a given threshold for each chromosome and genome-wide.
    • <SAMPLE>.<SUFFIX>.genome.mosdepth.summary.txt: Summary metrics including mean, min and max coverage values.
    • <SAMPLE>.<SUFFIX>.genome.per-base.bed.gz: Per-base depth output genome-wide.
    • <SAMPLE>.<SUFFIX>.genome.per-base.bed.gz.csi: CSI index that can be used for tabix queries from above file.
    • <SAMPLE>.<SUFFIX>.genome.regions.bed.gz: Mean regional depth for 200bp windows genome-wide.
    • <SAMPLE>.<SUFFIX>.genome.regions.bed.gz.csi: CSI index that can be used for tabix queries from above file.
  • variants/bam/mosdepth/genome/plots/
    • all_samples.<SUFFIX>.genome.regions.coverage.tsv: File aggregating genome-wide coverage values across all samples used for plotting.
    • <SAMPLE>.<SUFFIX>.genome.regions.coverage.pdf: Whole-genome coverage plot.
    • <SAMPLE>.<SUFFIX>.genome.regions.coverage.tsv: File containing coverage values for the above plot.
  • variants/bam/mosdepth/amplicon/
    • <SAMPLE>.<SUFFIX>.amplicon.mosdepth.global.dist.txt: A distribution of proportion of bases covered at or above a given threshold for each chromosome and genome-wide.
    • <SAMPLE>.<SUFFIX>.amplicon.mosdepth.region.dist.txt: A distribution of proportion of bases covered at or above a given threshold for each chromosome and genome-wide.
    • <SAMPLE>.<SUFFIX>.amplicon.mosdepth.summary.txt: Summary metrics including mean, min and max coverage values.
    • <SAMPLE>.<SUFFIX>.amplicon.per-base.bed.gz: Per-base depth output genome-wide.
    • <SAMPLE>.<SUFFIX>.amplicon.per-base.bed.gz.csi: CSI index that can be used for tabix queries from above file.
    • <SAMPLE>.<SUFFIX>.amplicon.regions.bed.gz: Mean regional depth for individual amplicons genome-wide.
    • <SAMPLE>.<SUFFIX>.amplicon.regions.bed.gz.csi: CSI index that can be used for tabix queries from above file.
    • <SAMPLE>.<SUFFIX>.amplicon.thresholds.bed.gz: Threshold output to indicate how many bases in each region are covered at given thresholds.
    • <SAMPLE>.<SUFFIX>.amplicon.thresholds.bed.gz.csi: CSI index that can be used for tabix queries from above file.
  • variants/bam/mosdepth/amplicon/plots/
    • all_samples.<SUFFIX>.amplicon.regions.coverage.tsv: File aggregating per-amplicon coverage values across all samples used for plotting.
    • all_samples.<SUFFIX>.amplicon.regions.heatmap.pdf: Heatmap showing per-amplicon coverage across all samples.
    • <SAMPLE>.<SUFFIX>.amplicon.regions.coverage.pdf: Bar plot showing per-amplicon coverage for an individual sample.
    • <SAMPLE>.<SUFFIX>.amplicon.regions.coverage.tsv: File containing per-amplicon coverage values for the above plot.

NB: The value of <SUFFIX> in the output file names above will depend on the preceeding steps that were run in the pipeline. If --protocol amplicon is specified then this process will be run on the iVar trimmed alignments and the value of <SUFFIX> will be trim.mkD. However, if --protocol metagenomic is specified then the process will be run on the alignments obtained directly from Bowtie 2 and the value of <SUFFIX> will be mkD; where mkD is an abbreviation for MarkDuplicates.

R - Sample genome-wide coverage plot

R - Sample per-amplicon coverage plot

VarScan 2, BCFTools, BEDTools

VarScan 2 is a platform-independent software tool to detect variants in NGS data. In this pipeline, VarScan 2 is used in conjunction with SAMtools in order to call both high and low frequency variants.

BCFtools is a set of utilities that manipulate variant calls in VCF and its binary counterpart BCF format. BCFTools is used in the variant calling and de novo assembly steps of this pipeline to obtain basic statistics from the VCF output. It is also used in the VarScan 2 variant calling branch of the pipeline to generate a consensus sequence by integrating high frequency variant calls into the reference genome.

BEDTools is a swiss-army knife of tools for a wide-range of genomics analysis tasks. In this pipeline we use bedtools genomecov to compute the per-base mapped read coverage in bedGraph format, and bedtools maskfasta to mask sequences in a Fasta file based on intervals defined in a feature file. This may be useful for creating your own masked genome file based on custom annotations or for masking all but your target regions when aligning sequence data from a targeted capture experiment.

Output files
  • variants/varscan2/
    • <SAMPLE>.vcf.gz: Low frequency variants VCF file.
    • <SAMPLE>.vcf.gz.tbi: Low frequency variants VCF index file.
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.vcf.gz: High frequency variants VCF file.
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.vcf.gz.tbi: High frequency variants VCF index file.
  • variants/varscan2/consensus/
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.consensus.fa: Consensus Fasta file generated by integrating the high frequency variants called by VarScan into the reference genome.
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.consensus.masked.fa: Masked consensus Fasta file.
  • variants/varscan2/consensus/base_qc/
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.ACTG_density.pdf: Plot showing density of ACGT bases within the consensus sequence.
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.base_counts.pdf: Plot showing frequency and percentages of all bases in consensus sequence.
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.base_counts.tsv: File containing frequency and percentages of all bases in consensus sequence.
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.N_density.pdf: Plot showing density of N bases within the consensus sequence.
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.N_run.tsv: File containing start positions and width of N bases in consensus sequence.
  • variants/varscan2/bcftools_stats/
    • <SAMPLE>.bcftools_stats.txt: Statistics and counts obtained from low frequency variants VCF file.
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.bcftools_stats.txt: Statistics and counts obtained from high frequency variants VCF file.
  • variants/varscan2/log/
    • <SAMPLE>.varscan2.log: Log file generated from stderr by VarScan 2.
  • variants/bam/mpileup/
    • <SAMPLE>.<SUFFIX>.mpileup: mpileup files summarize all the data from aligned reads at a given genomic position. Each row of the mpileup file gives similar information to a single vertical column of reads as visualised in IGV.

NB: The value of <MAX_ALLELE_FREQ> in the output file names above is determined by the --max_allele_freq parameter (Default: 0.8).
NB: Output mpileup files will only be saved in the directory if the --save_mpileup parameter is supplied. The naming convention for these files will depend on the preceeding steps that were run in the pipeline as described in the paragraph explaining the value of <SUFFIX> in the section above.

MultiQC - VarScan 2 variants called plot

iVar variants and iVar consensus

iVar is a computational package that contains functions broadly useful for viral amplicon-based sequencing. We use iVar in this pipeline to trim primer sequences for amplicon input data as well as to call variants and for consensus sequence generation.

Output files
  • variants/ivar/
    • <SAMPLE>.tsv: Low frequency variants in TSV format.
    • <SAMPLE>.vcf.gz: Low frequency variants VCF file.
    • <SAMPLE>.vcf.gz.tbi: Low frequency variants VCF index file.
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.vcf.gz: High frequency variants VCF file.
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.vcf.gz.tbi: High frequency variants VCF index file.
  • variants/ivar/consensus/
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.consensus.fa: Consensus Fasta file generated by iVar at the frequency threshold set by the --max_allele_freq parameter.
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.consensus.qual.txt: File with the average quality of each base in the consensus sequence.
  • variants/ivar/consensus/base_qc/
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.ACTG_density.pdf: Plot showing density of ACGT bases within the consensus sequence.
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.base_counts.pdf: Plot showing frequency and percentages of all bases in consensus sequence.
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.base_counts.tsv: File containing frequency and percentages of all bases in consensus sequence.
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.N_density.pdf: Plot showing density of N bases within the consensus sequence.
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.N_run.tsv: File containing start positions and width of N bases in consensus sequence.
  • variants/ivar/log/
    • <SAMPLE>.variant.counts.log: Variant counts for low frequency variants.
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.variant.counts.log: Variant counts for high frequency variants.
  • variants/ivar/bcftools_stats/
    • <SAMPLE>.bcftools_stats.txt: Statistics and counts obtained from low frequency variants VCF file.
    • <SAMPLE>.AF<MAX_ALLELE_FREQ>.bcftools_stats.txt: Statistics and counts obtained from high frequency variants VCF file.

MultiQC - iVar variants called plot

BCFTools and BEDTools

BCFtools can be used to call variants directly from BAM alignment files. The functionality to call variants with BCFTools in this pipeline was inspired by work carried out by Conor Walker. In contrast to VarScan 2 and iVar, the original variant calls obtained by BCFTools are not filtered further by a higher allele frequency. It seems that the default calls obtained by BCFTools appear to be comparable with the high frequency variants generated by VarScan 2 and iVar.

Output files
  • variants/bcftools/
    • <SAMPLE>.vcf.gz: Variants VCF file.
    • <SAMPLE>.vcf.gz.tbi: Variants VCF index file.
  • variants/bcftools/consensus/
    • <SAMPLE>.consensus.fa: Consensus Fasta file generated by integrating the variants called by BCFTools into the reference genome.
    • <SAMPLE>.consensus.masked.fa: Masked consensus Fasta file.
  • variants/bcftools/consensus/base_qc/
    • <SAMPLE>.ACTG_density.pdf: Plot showing density of ACGT bases within the consensus sequence.
    • <SAMPLE>.base_counts.pdf: Plot showing frequency and percentages of all bases in consensus sequence.
    • <SAMPLE>.base_counts.tsv: File containing frequency and percentages of all bases in consensus sequence.
    • <SAMPLE>.N_density.pdf: Plot showing density of N bases within the consensus sequence.
    • <SAMPLE>.N_run.tsv: File containing start positions and width of N bases in consensus sequence.
  • variants/bcftools/bcftools_stats/
    • <SAMPLE>.bcftools_stats.txt: Statistics and counts obtained from VCF file.

MultiQC - BCFTools variant counts

SnpEff and SnpSift

SnpEff is a genetic variant annotation and functional effect prediction toolbox. It annotates and predicts the effects of genetic variants on genes and proteins (such as amino acid changes).

SnpSift annotates genomic variants using databases, filters, and manipulates genomic annotated variants. After annotation with SnpEff, you can use SnpSift to help filter large genomic datasets in order to find the most significant variants.

Output files
  • variants/<CALLER>/snpeff/
    • *.snpEff.csv: Variant annotation csv file.
    • *.snpEff.genes.txt: Gene table for annotated variants.
    • *.snpEff.summary.html: Summary html file for variants.
    • *.snpEff.vcf.gz: VCF file with variant annotations.
    • *.snpEff.vcf.gz.tbi: Index for VCF file with variant annotations.
    • *.snpSift.table.txt: SnpSift summary table.

NB: The value of <CALLER> in the output directory name above is determined by the --callers parameter (Default: ‘varscan2,ivar,bcftools’). If applicable, you will have two sets of files where the file name prefix will be <SAMPLE> for low-frequency variants and <SAMPLE>.AF<MAX_ALLELE_FREQ> for high frequency variants.

MultiQC - SnpEff annotation counts

QUAST

QUAST is used to generate a single report with which to evaluate the quality of the consensus sequence across all of the samples provided to the pipeline. The HTML results can be opened within any browser (we recommend using Google Chrome). Please see the QUAST output docs for more detailed information regarding the output files.

Output files
  • variants/<CALLER>/quast/AF<MAX_ALLELE_FREQ>/
    • report.html: Results report in HTML format. Also available in various other file formats i.e. report.pdf, report.tex, report.tsv and report.txt.

NB: The value of <CALLER> in the output directory name above is determined by the --callers parameter (Default: ‘varscan2,ivar,bcftools’) and the value of <MAX_ALLELE_FREQ> is determined by the --max_allele_freq parameter (Default: 0.8).

BCFTools isec

BCFTools isec can be used to intersect the variant calls generated by the 3 different callers used in the pipeline. This permits a quick assessment of how consistently a particular variant is being called using different algorithms and to prioritise the investigation of the variants.

Output files
  • variants/intersect/<SAMPLE>/
    • *.vcf.gz: VCF file containing variants common to at least 2/3 callers. There will be one file for each caller - see README.txt for details.
    • *.vcf.gz.tbi: Index for VCF file.
    • README.txt: File containing command used and file name mappings.
    • sites.txt: List of variants common to at least 2/3 callers in textual format. The last column indicates presence (1) or absence (0) amongst the 3 different callers.

NB: This process will only be executed when all 3 variant callers are specified to run, as is by default i.e. --callers varscan2,ivar,bcftools.

De novo assembly

A file called summary_assembly_metrics_mqc.tsv containing a selection of read and de novo assembly related metrics will be saved in the assembly/ results directory. The same metrics have also been added to the top of the MultiQC report.

Cutadapt

In the variant calling branch of the pipeline we are using iVar trim to remove primer sequences from the aligned BAM files for amplicon data. Since in the de novo assembly branch we don’t align the reads, we use Cutadapt as an alternative option to remove and clean the primer sequences directly from FastQ files.

Output files
  • assembly/cutadapt/
    • *.ptrim.fastq.gz: FastQ files after primer sequence trimming.
  • assembly/cutadapt/log/
    • *.cutadapt.log: Cutadapt log file generated from stdout.
  • assembly/cutadapt/fastqc/
    • *.ptrim_fastqc.html: FastQC report of the trimmed reads.
  • assembly/cutadapt/fastqc/zips/
    • *.ptrim_fastqc.zip: Zip archive containing the FastQC report.

NB: Trimmed FastQ files will only be saved in the results directory if the --save_trimmed parameter is supplied.

MultiQC - Cutadapt filtered reads plot

Kraken 2

Kraken 2 is a sequence classifier that assigns taxonomic labels to DNA sequences. Kraken 2 examines the k-mers within a query sequence and uses the information within those k-mers to query a database. That database maps k-mers to the lowest common ancestor (LCA) of all genomes known to contain a given k-mer.

We used a Kraken 2 database in this workflow to filter out reads specific to the host genome. The remainder of the reads are then passed to numerous de novo assembly algorithms in order to reconstruct the viral genome.

Output files
  • assembly/kraken2/
    • *.host*.fastq.gz: Reads that were classified to the host database.
    • *.viral*.fastq.gz: Reads that were unclassified to the host database.
    • *.kraken2.report.txt: Kraken 2 taxonomic report. See here for a detailed description of the format.

NB: Output FastQ files will only be saved in the results directory if the --save_kraken2_fastq parameter is supplied.

MultiQC - Kraken 2 classification plot

SPAdes

SPAdes is an assembly toolkit containing various assembly pipelines. Generically speaking, SPAdes is one of the most popular de Bruijn graph-based assembly algorithms used for bacterial/viral genome reconstruction.

Bandage is a program for visualising de novo assembly graphs. By displaying connections which are not present in the contigs file, Bandage opens up new possibilities for analysing de novo assemblies.

Output files
  • assembly/spades/
    • *.scaffolds.fa: SPAdes scaffold assembly.
    • *.assembly.gfa: SPAdes assembly graph in GFA format.
  • assembly/spades/bandage/
    • *.png: Bandage visualisation for SPAdes assembly graph in PNG format.
    • *.svg: Bandage visualisation for SPAdes assembly graph in SVG format.

metaSPAdes

metaSPAdes is a de Bruijn graph-based assembler that is distributed with SPAdes and executed via the --meta option. It can be used for the simultaneous reconstruction of multiple genomes as observed in metagenomics data.

Output files
  • assembly/metaspades/
    • *.scaffolds.fa: metaSPAdes scaffold assembly.
    • *.assembly.gfa: metaSPAdes assembly graph in GFA format.
  • assembly/metaspades/bandage/
    • *.png: Bandage visualisation for metaSPAdes assembly graph in PNG format.
    • *.svg: Bandage visualisation for metaSPAdes assembly graph in SVG format.

Unicycler

Unicycler is an assembly pipeline for bacterial genomes. It can assemble Illumina-only read sets where it functions as a SPAdes-optimiser.

Output files
  • assembly/unicycler/
    • *.scaffolds.fa: Unicycler scaffold assembly.
    • *.assembly.gfa: Unicycler assembly graph in GFA format.
  • assembly/unicycler/bandage/
    • *.png: Bandage visualisation for Unicycler assembly graph in PNG format.
    • *.svg: Bandage visualisation for Unicycler assembly graph in SVG format.

minia

Minia is a short-read assembler based on a de Bruijn graph, capable of assembling a human genome on a desktop computer in a day. The output of Minia is a set of contigs. Minia produces results of similar contiguity and accuracy to other de Bruijn assemblers.

Output files
  • assembly/minia/<MINIA_KMER>/
    • *.scaffolds.fa: Minia scaffold assembly.

NB: The value of <MINIA_KMER> in the output directory name above is determined by the --minia_kmer parameter (Default: 31).

BLAST

blastn is used to align the assembled contigs against the virus reference genome.

Output files
  • assembly/<ASSEMBLER>/blast/
    • *.blast.txt: BLAST results against the target virus.
    • *.blast.filt.header.txt: Filtered BLAST results.

NB: The value of <ASSEMBLER> in the output directory name above is determined by the --assemblers parameter (Default: ‘spades,metaspades,unicycler,minia’).

ABACAS

ABACAS was developed to rapidly contiguate (align, order, orientate), visualize and design primers to close gaps on shotgun assembled contigs based on a reference sequence.

Output files
  • assembly/<ASSEMBLER>/abacas/
    • *.abacas.bin: Bin file that contains contigs that are not used in ordering.
    • *.abacas.crunch: Comparison file.
    • *.abacas.fasta: Ordered and orientated sequence file.
    • *.abacas.gaps: Gap information.
    • *.abacas.gaps.tab: Gap information in tab-delimited format.
    • *.abacas.MULTIFASTA.fa: A list of ordered and orientated contigs in a multi-fasta format.
    • *.abacas.tab: Feature file
    • *.unused_contigs.out: Information on contigs that have a mapping information but could not be used in the ordering.
  • assembly/<ASSEMBLER>/abacas/nucmer/: Folder containing the files generated by the NUCmer algorithm used by ABACAS.

NB: The value of <ASSEMBLER> in the output directory name above is determined by the --assemblers parameter (Default: ‘spades,metaspades,unicycler,minia’).

PlasmidID

PlasmidID was used to graphically represent the alignment of the reference genome relative to a given assembly. This helps to visualize the coverage of the reference genome in the assembly. To find more information about the output files refer to the documentation.

Output files
  • assembly/<ASSEMBLER>/plasmidid/<SAMPLE>/
    • images/<SAMPLE>_<REF_NAME>.png: PNG file with the visualization of the alignment between the viral assembly and the reference viral genome.
    • data/: Files used for drawing the circos images.
    • database/: Annotation files used for drawing the circos images.
    • fasta_files: Folder with fasta files that correspond to the selection of contigs/scaffolds required to reconstruct the reference genome generated in the images/ folder.
    • log/: Log files.

NB: The value of <ASSEMBLER> in the output directory name above is determined by the --assemblers parameter (Default: ‘spades,metaspades,unicycler,minia’).

Assembly QUAST

QUAST is used to generate a single report with which to evaluate the quality of the de novo assemblies across all of the samples provided to the pipeline. The HTML results can be opened within any browser (we recommend using Google Chrome). Please see the QUAST output docs for more detailed information regarding the output files.

Output files
  • assembly/<ASSEMBLER>/quast/
    • report.html: Results report in HTML format. Also available in various other file formats i.e. report.pdf, report.tex, report.tsv and report.txt.

NB: The value of <ASSEMBLER> in the output directory name above is determined by the --assemblers parameter (Default: ‘spades,metaspades,unicycler,minia’).

MultiQC - QUAST contig counts

Minimap2, seqwish, vg

Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Minimap2 was used to generate all-versus-all alignments between scaffold assembly contigs and the reference genome.

seqwish implements a lossless conversion from pairwise alignments between sequences to a variation graph encoding the sequences and their alignments. seqwish was used to induce a genome variation graph from the all-versus-all alignment generated by Minimap2.

vg is a collection of tools for working with genome variation graphs. vg was used to call variants from the genome variation graph generated by seqwish.

Bandage, a Bioinformatics Application for Navigating De novo Assembly Graphs Easily, is a GUI program that allows users to interact with the assembly graphs made by de novo assemblers and other graphs in GFA format. Bandage was used to render induced genome variation graphs as static PNG and SVG images.

Output files
  • assembly/<ASSEMBLER>/variants/
    • *.gfa: Induced genome variation graph.
    • *.vcf.gz: VCF file with variant annotations.
    • *.vcf.gz.tbi: Index for VCF file with variant annotations.
  • assembly/<ASSEMBLER>/variants/bcftools_stats/
    • *.bcftools_stats.txt: Statistics and counts for variants in VCF files.
  • assembly/<ASSEMBLER>/bandage/
    • *.png: Bandage visualisation for induced genome variation graph in PNG format.
    • *.svg: Bandage visualisation for induced genome variation graph in SVG format.

NB: The value of <ASSEMBLER> in the output directory name above is determined by the --assemblers parameter (Default: ‘spades,metaspades,unicycler,minia’).

Assembly SnpEff and SnpSift

SnpEff is a genetic variant annotation and functional effect prediction toolbox. It annotates and predicts the effects of genetic variants on genes and proteins (such as amino acid changes).

SnpSift annotates genomic variants using databases, filters, and manipulates genomic annotated variants. After annotation with SnpEff, you can use SnpSift to help filter large genomic datasets in order to find the most significant variants.

Output files
  • assembly/<ASSEMBLER>/variants/snpeff/
    • *.snpEff.csv: Variant annotation csv file.
    • *.snpEff.genes.txt: Gene table for annotated variants.
    • *.snpEff.summary.html: Summary html file for variants.
    • *.snpEff.vcf.gz: VCF file with variant annotations.
    • *.snpEff.vcf.gz.tbi: Index for VCF file with variant annotations.
    • *.snpSift.table.txt: SnpSift summary table.

NB: The value of <ASSEMBLER> in the output directory name above is determined by the --assemblers parameter (Default: ‘spades,metaspades,unicycler,minia’).

Workflow reporting and genomes

MultiQC

MultiQC is a visualization tool that generates a single HTML report summarizing all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from FastQC, fastp, Cutadapt, Bowtie 2, Kraken 2, VarScan 2, iVar, samtools flagstat, samtools idxstats, samtools stats, picard CollectMultipleMetrics and CollectWgsMetrics, BCFTools, SnpEff and QUAST.

The default multiqc config file has been written in a way in which to structure these QC metrics to make them more interpretable in the final report.

The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Please click here to see an example MultiQC report generated using the parameters defined in this configuration file to run the pipeline on samples which were prepared from the ncov-2019 ARTIC Network V1 amplicon set and sequenced on the Illumina MiSeq platform in 301bp paired-end format.

Output files
  • multiqc/
    • multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
    • multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
    • multiqc_plots/: directory containing static images from the report in various formats.

Reference genome files

A number of genome-specific files are generated by the pipeline because they are required for the downstream processing of the results. If the --save_reference parameter is provided then the Bowtie 2 alignment indices, BLAST and Kraken 2 databases downloaded/generated by the pipeline will be saved in the genome/ directory. It is recommended to use the --save_reference parameter if you are using the pipeline to build a Kraken 2 database for the host genome. This can be quite a time-consuming process and it permits their reuse for future runs of the pipeline or for other purposes.

Output files
  • genome/
    • BlastDB/: BLAST database for viral genome.
    • Bowtie2Index/: Bowtie 2 index for viral genome.
    • kraken2_<KRAKEN2_DB_NAME>/: Kraken 2 database for host genome.
    • SnpEffDB/: SnpEff database for viral genome.
    • snpeff.config: SnpEff config file for viral genome.
    • Unzipped genome fasta file for viral genome
    • Unzipped genome annotation GFF file for viral genome

Pipeline information

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.csv.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
    • Documentation for interpretation of results in HTML format: results_description.html.