Introduction

This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

FastQC

FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.

Output files
  • fastqc/
    • *_fastqc.html: FastQC report containing quality metrics for your untrimmed raw fastq files.

Cutadapt

Cutadapt is trimming primer sequences from sequencing reads. Primer sequences are non-biological sequences that often introduce point mutations that do not reflect sample sequences. This is especially true for degenerated PCR primer. If primer trimming would be omitted, artifactual amplicon sequence variants might be computed by the denoising tool or sequences might be lost due to become labelled as PCR chimera.

Output files
  • cutadapt/: directory containing log files with retained reads, trimming percentage, etc. for each sample.
    • cutadapt_summary.tsv: Summary of read numbers that pass cutadapt.
    • assignTaxonomy.cutadapt.log: Contains how many expected amplified sequences were extracted from the DADA2 reference taxonomy database. Optional.

MultiQC

MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Output files
  • multiqc/
    • multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
    • multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
    • multiqc_plots/: directory containing static images from the report in various formats.

NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.

DADA2

DADA2 performs fast and accurate sample inference from amplicon data with single-nucleotide resolution. It infers exact amplicon sequence variants (ASVs) from amplicon data with fewer false positives than many other methods while maintaining high sensitivity.

DADA2 computes an error model on the sequencing reads (forward and reverse independently), therefore quality filtering or paired read merging may not be performed before. Each sequencing run varies in their error profile and it is recommended that DADA2 runs separately on data from each run individually. It is recommended to use the ampliseq option --multiple_sequencing_runs to analyse such data.

DADA2 reduces sequence errors and dereplicates sequences by quality filtering, denoising, read pair merging (for paired end Illumina reads only) and PCR chimera removal.

Additionally, DADA2 taxonomically classifies the ASVs using a choice of supplied databases (specified with --dada_ref_taxonomy).

Output files
  • dada2/
    • ASV_seqs.fasta: Fasta file with ASV sequences.
    • ASV_table.tsv: Counts for each ASV sequence.
    • ASV_tax.tsv: Taxonomic classification for each ASV sequence.
    • ASV_tax_species.tsv: Species classification for each ASV sequence.
    • ref_taxonomy.txt: Information about the used reference taxonomy, such as title, version, citation.
    • DADA2_stats.tsv: Tracking read numbers through DADA2 processing steps, for each sample.
    • DADA2_table.rds: DADA2 ASV table as R object.
    • DADA2_tables.tsv: DADA2 ASV table.
  • dada2/args/: Directory containing files with all parameters for DADA2 steps.
  • dada2/log/: Directory containing log files for DADA2 steps.
  • dada2/QC/
    • *.err.convergence.txt: Convergence values for DADA2’s dada command, should reduce over several magnitudes and approaching 0.
    • *.err.pdf: Estimated error rates for each possible transition. The black line shows the estimated error rates after convergence of the machine-learning algorithm. The red line shows the error rates expected under the nominal definition of the Q-score. The estimated error rates (black line) should be a good fit to the observed rates (points), and the error rates should drop with increased quality.
    • *_qual_stats.pdf: Overall read quality profiles: heat map of the frequency of each quality score at each base position. The mean quality score at each position is shown by the green line, and the quartiles of the quality score distribution by the orange lines. The red line shows the scaled proportion of reads that extend to at least that position.

Barrnap

Barrnap predicts the location of ribosomal RNA genes in genomes, here it can be used to discriminate rRNA sequences from potential contamination. It supports bacteria (5S,23S,16S), archaea (5S,5.8S,23S,16S), metazoan mitochondria (12S,16S) and eukaryotes (5S,5.8S,28S,18S).

Optionally, ASV sequences can be filtered for rRNA sequences identified by Barrnap with --filter_ssu that can take a list of abbreviations of the above supported categories (kingdoms), e.g. bac,arc,mito,euk. This filtering takes place after DADA2’s ASV computation (i.e. after chimera removal) but before taxonomic classification (also applies to above mentioned taxonomic classification with DADA2, i.e. files ASV_tax.tsv & ASV_tax_species.tsv).

Output files
  • barrnap/
    • ASV_seqs.ssu.fasta: Fasta file with filtered ASV sequences.
    • AASV_table.ssu.tsv: Counts for each filtered ASV sequence.
    • rrna.<kingdom>.gff: GFF3 output for rRNA matches per kingdom, where kingdom is one of bac,arc,mito,euk.
    • stats.ssu.tsv: Tracking read numbers through filtering, for each sample.

ITSx

Optionally, the ITS region can be extracted from each ASV sequence using ITSx, and taxonomic classification is performed based on the ITS sequence.

Output files
  • itsx/
    • ASV_ITS_seqs.full.fasta: Fasta file with full ITS region from each ASV sequence.
    • ASV_ITS_seqs.ITS1.fasta or ASV_ITS_seqs.ITS2.fasta: If using —cut_its “its1” or —cut_its “its2”; fasta file with ITS1 or ITS2 region from each ASV sequence.
    • ASV_ITS_seqs.full_and_partial.fasta: If using —its_partial; fasta file with full and partial ITS regions from each ASV sequence.
    • ASV_ITS_seqs.ITS1.full_and_partial.fasta or ASV_ITS_seqs.ITS2.full_and_partial.fasta: If using —cut_its “its1” or —cut_its “its2” and —its_partial; fasta file with complete and partial ITS1 or ITS2 regions from each ASV sequence.
    • ASV_ITS_seqs.summary.txt: Summary information from ITSx.
    • ITSx.args.txt: File with parameters passed to ITSx.
  • dada2/
    • ASV_ITS_tax.tsv: Taxonomic classification with ITS region of each ASV sequence.
    • ASV_ITS_tax_species.tsv: Species classification with ITS region of each ASV sequence.
    • ASV_tax.tsv: Taxonomic classification of each ASV sequence, based on the ITS region.
    • ASV_tax_species.tsv: Species classification of each ASV sequence, based on the ITS region.

QIIME2

Quantitative Insights Into Microbial Ecology 2 (QIIME2) is a next-generation microbiome bioinformatics platform and the successor of the widely used QIIME1.

ASV sequences, counts, and taxonomic classification as produced before with DADA2 are imported into QIIME2 and further analysed. Optionally, ASVs can be taxonomically classified also with QIIME2 against a database chosen with --qiime_ref_taxonomy (but DADA2 taxonomic classification takes precedence). Next, ASVs are filtered (--exclude_taxa, --min_frequency, --min_samples), and abundance tables are exported. Following, diversity indices are calculated and testing for differential abundant features between sample groups is performed.

Taxonomic classification

Taxonomic classification with QIIME2 is typically similar to DADA2 classifications. However, both options are available. When taxonomic classification with DADA2 and QIIME2 is performed, DADA2 classification takes precedence over QIIME2 classifications for all downstream analysis.

Output files
  • qiime2/taxonomy/
    • taxonomy.tsv: Tab-separated table with taxonomic classification for each ASV
    • *-classifier.qza: QIIME2 artefact of the trained classifier. Can be supplied to other pipeline runs with --classifier
    • ref_taxonomy.txt: Information about the used reference taxonomy, such as title, version, citation.

Exclude taxa

Removes unwanted taxa in DADA2 output sequences and abundance tables by taxonomic classification. Unwanted taxa are often off-targets generated in PCR with primers that are not perfectly specific for the target DNA. For example, PCR with commonly used primers also amplifyies mitrochindrial or chloroplast rRNA genes and therefore leads to non-bacteria products. These mitrochondria or chloroplast amplicons are removed in this step by default (--exclude_taxa). The tables are based on the computed taxonomic classification (DADA2 classification takes precedence over QIIME2 classifications).

All following analysis is based on these filtered tables.

Output files
  • qiime2/representative_sequences/
    • rep-seq.fasta: Fasta file with ASV sequences.
    • descriptive_stats.tsv: Length, mean, etc. of ASV sequences.
    • seven_number_summary.tsv: Length of ASV sequences in different quantiles.
  • qiime2/abundance_tables/
    • abs-abund-table-*.tsv: Tab-separated absolute abundance table at taxa level *, where * ranges by default from 2 to 6 or 7, depending on the used reference taxonomy database.
    • count_table_filter_stats.tsv: Tab-separated table with information on how much counts were filtered for each sample.
    • feature-table.biom: Abundance table in biom format for importing into downstream analysis tools.
    • feature-table.tsv: Tab-separated abundance table for each ASV and each sample.

Relative abundance tables

Absolute abundance tables produced by the previous steps contain count data, but the compositional nature of 16S rRNA amplicon sequencing requires sequencing depth normalisation. This step computes relative abundance tables for various taxonomic levels and detailed tables for all ASVs with taxonomic classification, sequence and relative abundance for each sample. Typically used for in depth investigation of taxa abundances. If not specified, the tables are based on the computed taxonomic classification (DADA2 classification takes precedence over QIIME2 classifications).

Output files
  • qiime2/rel_abundance_tables/
    • rel-table-*.tsv: Tab-separated absolute abundance table at taxa level *, where * ranges by default from 2 to 6 or 7, depending on the used reference taxonomy database.
    • rel-table-ASV.tsv: Tab-separated relative abundance table for all ASVs.
    • rel-table-ASV_with-DADA2-tax.tsv: Tab-separated table for all ASVs with DADA2 taxonomic classification, sequence and relative abundance.
    • rel-table-ASV_with-QIIME2-tax.tsv: Tab-separated table for all ASVs with QIIME2 taxonomic classification, sequence and relative abundance.

Barplot

Produces an interactive abundance plot count tables that aids exploratory browsing the discovered taxa and their abundance in samples and allows sorting for associated meta data, DADA2 classification takes precedence over QIIME2 classifications.

Output files
  • qiime2/barplot/
    • index.html: Interactive barplot for taxa abundance per sample that can be viewed in your web browser.

Alpha diversity rarefaction curves

Produces rarefaction plots for several alpha diversity indices, and is primarily used to determine if the richness of the samples has been fully observed or sequenced. If the slope of the curves does not level out and the lines do not becomes horizontal, this might be because the sequencing depth was too low to observe all diversity or that sequencing error artificially increases sequence diversity and causes false discoveries.

Output files
  • qiime2/alpha-rarefaction/
    • index.html: Interactive alphararefaction curve for taxa abundance per sample that can be viewed in your web browser.

Diversity analysis

Diversity measures summarize important sample features (alpha diversity) or differences between samples (beta diversity). To do so, sample data is first rarefied to the minimum number of counts per sample. Also, a phylogenetic tree of all ASVs is computed to provide phylogenetic information.

Output files
  • qiime2/diversity/
    • Use the sampling depth of * for rarefaction.txt: File that reports the rarefaction depth in the file name and file content.
  • qiime2/phylogenetic_tree/
    • tree.nwk: Phylogenetic tree in newick format.
    • rooted-tree.qza: Phylogenetic tree in QIIME2 format.
Alpha diversity indices

Alpha diversity measures the species diversity within samples. Diversity calculations are based on sub-sampled data rarefied to the minimum read count of all samples. This step calculates alpha diversity using various methods and performs pairwise comparisons of groups of samples. It is based on a phylogenetic tree of all ASV sequences.

Output files
  • qiime2/diversity/alpha_diversity/
    • evenness_vector/index.html: Pielou’s Evenness.
    • faith_pd_vector/index.html: Faith’s Phylogenetic Diversity (qualitiative, phylogenetic).
    • observed_otus_vector/index.html: Observed OTUs (qualitative).
    • shannon_vector/index.html: Shannon’s diversity index (quantitative).
Beta diversity indices

Beta diversity measures the species community differences between samples. Diversity calculations are based on sub-sampled data rarefied to the minimum read count of all samples. This step calculates beta diversity distances using various methods and performs pairwise comparisons of groups of samples. Additionally, principle coordinates analysis (PCoA) plots are produced that can be visualized with Emperor in your default browser without the need for installation. This calculations are based on a phylogenetic tree of all ASV sequences. Furthermore, ADONIS permutation-based statistical test in vegan-R determine whether groups of samples are significantly different from one another. By default, all metadata columns that are for suitable pairwise comparisons will be tranformed into a formula, e.g. “treatment1+treatment2”. A custom formula can be supplied with --qiime_adonis_formula.

The following methods are used to calculate community dissimilarities:

  • Jaccard distance (qualitative)
  • Bray-Curtis distance (quantitative)
  • unweighted UniFrac distance (qualitative, phylogenetic)
  • weighted UniFrac distance (quantitative, phylogenetic)
Output files
  • qiime2/diversity/beta_diversity/
    • <method>_distance_matrix-<treatment>/index.html: Box plots and significance analysis (PERMANOVA).
    • <method>_pcoa_results-PCoA/index.html: Interactive PCoA plot.
  • qiime2/diversity/beta_diversity/adonis
    • <method>_distance_matrix/index.html: Interactive (and .tsv) table of metadata feature importance and significance.
      • method: bray_curtis, jaccard, unweighted_unifrac, weighted_unifrac
      • treatment: depends on your metadata sheet or what metadata categories you have specified

ANCOM

Analysis of Composition of Microbiomes (ANCOM) is applied to identify features that are differentially abundant across sample groups. A key assumption made by ANCOM is that few taxa (less than about 25%) will be differentially abundant between groups otherwise the method will be inaccurate.

ANCOM is applied to each suitable or specified metadata column for 5 taxonomic levels (2-6).

Output files
  • qiime2/ancom/
    • Category-<treatment>-<taxonomic level>/index.html: Statistical results and interactive Volcano plot.
      • treatment: depends on your metadata sheet or what metadata categories you have specified
      • taxonomic level: level-2 (phylum), level-3 (class), level-4 (order), level-5 (family), level-6 (genus), ASV

PICRUSt2

PICRUSt2 (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States) is a software for predicting functional abundances based only on marker gene sequences. On demand (--picrust), Enzyme Classification numbers (EC), KEGG orthologs (KO) and MetaCyc ontology predictions will be made for each sample.

PICRUSt2 is preferentially applied to filtered data by QIIME2 but will use DADA2 output in case QIIME2 isnt run.

Output files
  • PICRUSt2/
    • EC_pred_metagenome_unstrat_descrip.tsv: Predicted quantifications for Enzyme Classification numbers (EC).
    • KO_pred_metagenome_unstrat_descrip.tsv: Predicted quantifications for KEGG orthologs (KO).
    • METACYC_path_abun_unstrat_descrip.tsv: Predicted quantifications for MetaCyc ontology.
    • picrust.args.txt: File containing arguments from the config file
  • PICRUSt2/all_output

NB: Quantifications are not normalized yet, they can be normalized e.g. by the total sum per sample.

SBDI export

You can use the --sbdiexport flag (or sbdiexport: true in a nextflow parameter file using -params-file in yml format) to generate tab separated files in preparation for submission to the Swedish Biodiversity Infrastructure (SBDI).

Tables are generated from the DADA2 denoising and taxonomy assignment steps. Each table, except annotation.tsv, corresponds to one tab in the submission template. See docs/usage.md for further information. Most of the fields in the template will not be populated by the export process, but if you run nf-core/ampliseq with a sample metadata table (--metadata) any fields corresponding to a field in the template will be used.

Output files
  • SBDI/
    • annotation.tsv: SBDI specific output for taxonomi reannotation, not used in submission to SBDI.
    • asv-table.tsv: asv-table tab of template.
    • emof.tsv: emof tab of template.
    • event.tsv: event tab of template.
    • mixs.tsv: mixs tab of template.

Read count report

This report includes information on how many reads per sample passed each pipeline step in which a loss can occur. Specifically, how many read pairs entered cutadapt, were reverse complemented, passed trimming; how many read pairs entered DADA2, were denoised, merged and non-chimeric; and how many counts were lost during excluding unwanted tax and removing low abundance/prevalence sequences in QIIME2.

Output files
  • overall_summary.tsv: Tab-separated file with count summary.

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.