nf-core/ampliseq
Amplicon sequencing analysis workflow using DADA2 and QIIME2
2.1.0
). The latest
stable release is
2.11.0
.
Either a tab-seperated sample sheet, a fasta file, or a folder containing zipped FastQ files
string
Points to the main pipeline input, one of the following:
- folder containing compressed fastq files
- sample sheet ending with
.tsv
that points towards compressed fastq files - fasta file ending with
.fasta
,.fna
or.fa
that will be taxonomically classified
Related parameters are:
--pacbio
and--iontorrent
if the sequencing data is PacBio data or IonTorrent data (default expected: paired-end Illumina data)--single_end
if the sequencing data is single-ended Illumina data (default expected: paired-end Illumina data)--multiple_sequencing_runs
(folder input only) if the sequencing data originates from multiple sequencing runs--extension
(folder input only) if the sequencing file names do not follow the default ("/*_R{1,2}_001.fastq.gz"
)--dada_ref_taxonomy
and--qiime_ref_taxonomy
to choose an appropriate reference taxonomy for the type of amplicon (16S/18S/ITS) (default: 16S rRNA sequence database)
Folder containing zipped FastQ files
For example:
--input 'path/to/data'
Example for input data organization from one sequencing run with two samples, paired-end data:
data
|-sample1_1_L001_R1_001.fastq.gz
|-sample1_1_L001_R2_001.fastq.gz
|-sample2_1_L001_R1_001.fastq.gz
|-sample2_1_L001_R2_001.fastq.gz
Please note the following requirements:
- The path must be enclosed in quotes
- The folder must contain gzip compressed demultiplexed fastq files. If the file names do not follow the default (
"/*_R{1,2}_001.fastq.gz"
), please check--extension
. - Sample identifiers are extracted from file names, i.e. the string before the first underscore
_
, these must be unique - If your data is scattered, produce a sample sheet
- All sequencing data should originate from one sequencing run, because processing relies on run-specific error models that are unreliable when data from several sequencing runs are mixed. Sequencing data originating from multiple sequencing runs requires additionally the parameter
--multiple_sequencing_runs
and a specific folder structure.
Sample sheet
The sample sheet file is an alternative way to provide input reads, it must be a tab-separated file ending with .tsv
that must have two to four columns with the following headers:
sampleID
(required): Unique sample identifiers, any unique string (may not contain dots.
)forwardReads
(required): Paths to (forward) reads zipped FastQ filesreverseReads
(optional): Paths to reverse reads zipped FastQ files, required if the data is paired-endrun
(optional): If the data was produced by multiple sequencing runs, any string
For example:
--input 'path/to/samplesheet.tsv'
Fasta file
When pointing at a file ending with .fasta
, .fna
or .fa
, the containing sequences will be taxonomically classified. All other pipeline steps will be skipped.
This can be used to taxonomically classify previously produced ASV/OTU sequences.
For example:
--input 'path/to/amplicon_sequences.fasta'
Forward primer sequence
string
In amplicon sequencing methods, PCR with specific primers produces the amplicon of interest. These primer sequences need to be trimmed from the reads before further processing and are also required for producing an appropriate classifier. Do not use here any technical sequence such as adapter sequences but only the primer sequence that matches the biological amplicon.
For example:
--FW_primer "GTGYCAGCMGCCGCGGTAA" --RV_primer "GGACTACNVGGGTWTCTAAT"
Reverse primer sequence
string
In amplicon sequencing methods, PCR with specific primers produces the amplicon of interest. These primer sequences need to be trimmed from the reads before further processing and are also required for producing an appropriate classifier. Do not use here any technical sequence such as adapter sequences but only the primer sequence that matches the biological amplicon.
For example:
--FW_primer GTGYCAGCMGCCGCGGTAA --RV_primer GGACTACNVGGGTWTCTAAT
Path to metadata sheet, when missing most downstream analysis are skipped (barplots, PCoA plots, ...).
string
This is optional, but for performing downstream analysis such as barplots, diversity indices or differential abundance testing, a metadata file is essential.
Related parameter:
--metadata_category
(optional) to choose columns that are used for testing significance
For example:
--metadata "path/to/metadata.tsv"
Please note the following requirements:
- The path must be enclosed in quotes
- The metadata file has to follow the QIIME2 specifications (https://docs.qiime2.org/2021.2/tutorials/metadata/)
The first column in the tab-separated metadata file is the sample identifier column (required header: ID
) and defines the sample or feature IDs associated with your study. Metadata files are not required to have additional metadata columns, so a file containing only an ID column is a valid QIIME 2 metadata file. Additional columns defining metadata associated with each sample or feature ID are optional.
NB: without additional columns there might be no groupings for the downstream analyses.
Sample identifiers should be 36 characters long or less, and also contain only ASCII alphanumeric characters (i.e. in the range of [a-z], [A-Z], or [0-9]), or the dash (-) character. For downstream analysis, by default all numeric columns, blanks or NA are removed, and only columns with multiple different values but not all unique are selected.
The columns which are to be assessed can be specified by --metadata_category
. If --metadata_category
isn't specified than all columns that fit the specification are automatically chosen.
Define where the pipeline should find input data and save output data.
If data is single-ended PacBio reads instead of Illumina
boolean
If data is single-ended IonTorrent reads instead of Illumina
boolean
If data is single-ended Illumina reads instead of paired-end
boolean
If data is long read ITS sequences, that need to be cut to ITS region only for taxonomy assignment
boolean
If samples were sequenced in multiple sequencing runs
boolean
Expects one sub-folder per sequencing run in the folder specified by --input
containing sequencing data of the specific run.
Sample identifiers are taken from sequencing files, specifically the string before the first underscore will be the sample ID. Sample IDs across all sequencing runs (all sequencing files) have to be unique. If this is not the case, please use a sample sheet as input instead.
Example for input data organization:
data
|-run1
| |-sample1_1_L001_R1_001.fastq.gz
| |-sample1_1_L001_R2_001.fastq.gz
| |-sample2_1_L001_R1_001.fastq.gz
| |-sample2_1_L001_R2_001.fastq.gz
|
|-run2
|-sample3_1_L001_R1_001.fastq.gz
|-sample3_1_L001_R2_001.fastq.gz
|-sample4_1_L001_R1_001.fastq.gz
|-sample4_1_L001_R2_001.fastq.gz
Example command to analyze this data in one pipeline run:
nextflow run nf-core/ampliseq \
-profile singularity \
--input "data" \
--FW_primer "GTGYCAGCMGCCGCGGTAA" \
--RV_primer "GGACTACNVGGGTWTCTAAT" \
--metadata "data/Metadata.tsv" \
--multiple_sequencing_runs
If analysing ITS amplicons or any other region with large length variability with Illumina paired end reads
boolean
This will cause the pipeline to
- not truncate input reads if not
--trunclenf
and--trunclenr
are overwriting defaults - remove reverse complement primers from the end of reads in case the read length exceeds the amplicon length
Not recommended: When paired end reads are not sufficiently overlapping for merging.
boolean
This parameters specifies that paired-end reads are not merged after denoising but concatenated (separated by 10 N's). This is of advantage when an amplicon was sequenced that is too long for merging (i.e. bad experimental design). This is an alternative to only analyzing the forward or reverse read in case of non-overlapping paired-end sequencing data.
This parameter is not recommended! Only if all other options fail.
Mode of sample inference: "independent", "pooled" or "pseudo"
string
If samples are treated independent (lowest sensitivity and lowest resources), pooled (highest sensitivity and resources) or pseudo-pooled (balance between required resources and sensitivity).
Comma separated list of metadata column headers for statistics.
string
Here columns in the metadata sheet can be chosen with groupings that are used for diversity indices and differential abundance analysis. By default, all suitable columns in the metadata sheet will be used if this option is not specified. Suitable are columns which are categorical (not numerical) and have multiple different values which are not all unique. For example:
--metadata_category "treatment1,treatment2"
Please note the following requirements:
- Comma separated list enclosed in quotes
- May not contain whitespace characters
- Each comma separated term has to match exactly one column name in the metadata sheet
Naming of sequencing files
string
/*_R{1,2}_001.fastq.gz
Indicates the naming of sequencing files (default: "/*_R{1,2}_001.fastq.gz"
).
Please note:
- The prepended slash (
/
) is required - The star (
*
) is the required wildcard for sample names - The curly brackets (
{}
) enclose the orientation for paired end reads, seperated by a comma (,
). - The pattern must be enclosed in quotes
For example for one sample (name: 1
) with forward (file: 1_a.fastq.gz
) and reverse (file: 1_b.fastq.gz
) reads in folder data
:
--input "data" --extension "/*_{a,b}.fastq.gz"
If the functional potential of the bacterial community is predicted.
boolean
If data should be exported in SBDI (Swedish biodiversity infrastructure) Excel format.
boolean
Path to the output directory where the results will be saved.
string
./results
Email address for completion summary.
string
^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file (~/.nextflow/config
) then you don't need to specify this on the command line for every run.
Cutadapt will retain untrimmed reads, choose only if input reads are not expected to contain primer sequences.
boolean
When read sequences are trimmed, untrimmed read pairs are discarded routinely. Use this option to retain untrimmed read pairs. This is usually not recommended and is only of advantage for specific protocols that prevent sequencing PCR primers.
Cutadapt will be run twice to ensure removal of potential double primers
boolean
Cutdapt will be run twice, first to remove reads without primers (default), then a second time to remove reads that erroneously contain a second set of primers, not to be used with --retain_untrimmed
.
DADA2 read truncation value for forward strand, set this to 0 for no truncation
integer
Read denoising by DADA2 creates an error profile specific to a sequencing run and uses this to correct sequencing errors. This method prefers when all reads to have the same length and as high quality as possible while maintaining at least 20 bp overlap for merging. One cutoff for the forward read --trunclenf
and one for the reverse read --trunclenr
truncate all longer reads at that position and drop all shorter reads.
If not set, these cutoffs will be determined automatically for the position before the mean quality score drops below --trunc_qmin
.
For example:
--trunclenf 180 --trunclenr 120
Please note:
- Overly aggressive truncation might lead to insufficient overlap for read merging
- Too little truncation might reduce denoised reads
- The code choosing these values automatically cannot take the points above into account, therefore checking read numbers is essential
DADA2 read truncation value for reverse strand, set this to 0 for no truncation
integer
Read denoising by DADA2 creates an error profile specific to a sequencing run and uses this to correct sequencing errors. This method prefers when all reads to have the same length and as high quality as possible while maintaining at least 20 bp overlap for merging. One cutoff for the forward read --trunclenf
and one for the reverse read --trunclenr
truncate all longer reads at that position and drop all shorter reads.
If not set, these cutoffs will be determined automatically for the position before the mean quality score drops below --trunc_qmin
.
For example:
--trunclenf 180 --trunclenr 120
Please note:
- Overly aggressive truncation might lead to insufficient overlap for read merging
- Too little truncation might reduce denoised reads
- The code choosing these values automatically cannot take the points above into account, therefore checking read numbers is essential
If --trunclenf and --trunclenr are not set, these values will be automatically determined using this median quality score
integer
25
Automatically determine --trunclenf
and --trunclenr
before the median quality score drops below --trunc_qmin
. The fraction of reads retained is defined by --trunc_rmin
, which might override the quality cutoff.
For example:
--trunc_qmin 35
Please note:
- The code choosing
--trunclenf
and--trunclenr
using--trunc_qmin
automatically cannot take amplicon length or overlap requirements for merging into account, therefore use with caution. - A minimum value of 25 is recommended. However, high quality data with a large paired sequence overlap might justify a higher value (e.g. 35). Also, very low quality data might require a lower value.
- If the quality cutoff is too low to include a certain fraction of reads that is specified by
--trunc_rmin
(e.g. 0.75 means at least 75% percent of reads are retained), a lower cutoff according to--trunc_rmin
superseeds the quality cutoff.
Assures that values chosen with --trunc_qmin will retain a fraction of reads.
number
0.75
Value can range from 0 to 1. 0 means no reads need to be retained and 1 means all reads need to be retained. The minimum lengths of --trunc_qmin and --trunc_rmin are chosen as DADA2 cutoffs.
DADA2 read filtering option
integer
2
After truncation, reads with higher than max_ee
"expected errors" will be discarded. In case of very long reads, you might want to increase this value. We recommend (to start with) a value corresponding to approximately 1 expected error per 100-200 bp (default: 2)
DADA2 read filtering option
integer
Remove reads with length greater than max_len
after trimming and truncation. Must be a positive integer.
DADA2 read filtering option
integer
50
Remove reads with length less than min_len
after trimming and truncation.
Name of supported database, and optionally also version number
string
Choose any of the supported databases, and optionally also specify the version. Database and version are separated by an equal sign (=
, e.g. silva=138
) . This will download the desired database, format it to produce a file that is compatible with DADA2's assignTaxonomy and another file that is compatible with DADA2's addSpecies.
The following databases are supported:
- GTDB - Genome Taxonomy Database - 16S rRNA
- PR2 - Protist Reference Ribosomal Database - 18S rRNA
- RDP - Ribosomal Database Project - 16S rRNA
- SILVA ribosomal RNA gene database project - 16S rRNA
- UNITE - eukaryotic nuclear ribosomal ITS region - ITS
Generally, using gtdb
, pr2
, rdp
, sbdi-gtdb
, silva
, unite-fungi
, or unite-alleuk
will select the most recent supported version. For details on what values are valid, please either use an invalid value such as x
(causing the pipeline to send an error message with a list of all valid values) or see conf/ref_databases.config
.
Please note that commercial/non-academic entities require licensing for SILVA v132 database (non-default) but not from v138 on (default).
If the expected amplified sequences are extracted from the DADA2 reference taxonomy database
boolean
Expected amplified sequences are extracted from the DADA2 reference taxonomy using the primer sequences, that might improve classification. This is not applied to species classification (assignSpecies) but only for lower taxonomic levels (assignTaxonomy).
Name of supported database, and optionally also version number
string
Choose any of the supported databases, and optionally also specify the version. Database and version are separated by an equal sign (=
, e.g. silva=138
) . This will download the desired database and initiate taxonomic classification with QIIME2 and the chosen database.
If both, --dada_ref_taxonomy
and --qiime_ref_taxonomy
are used, DADA2 classification will be used for downstream analysis.
The following databases are supported:
- SILVA ribosomal RNA gene database project - 16S rRNA
- UNITE - eukaryotic nuclear ribosomal ITS region - ITS
- Greengenes (only testing!)
Generally, using silva
, unite-fungi
, or unite-alleuk
will select the most recent supported version. For testing purposes, the tiny database greengenes85
(dereplicated at 85% sequence similarity) is available. For details on what values are valid, please either use an invalid value such as x
(causing the pipeline to send an error message with all valid values) or see conf/ref_databases.config
.
Path to QIIME2 trained classifier file (typically *-classifier.qza)
string
If you have trained a compatible classifier before, from sources such as SILVA (https://www.arb-silva.de/), Greengenes (http://greengenes.secondgenome.com/downloads) or RDP (https://rdp.cme.msu.edu/).
For example:
--classifier "FW_primer-RV_primer-classifier.qza"
Please note the following requirements:
- The path must be enclosed in quotes
- The classifier is a Naive Bayes classifier produced by
qiime feature-classifier fit-classifier-naive-bayes
(e.g. by this pipeline) - The primer pair for the amplicon PCR and the computing of the classifier are exactly the same (or full-length, potentially lower performance)
- The classifier has to be trained by the same version of scikit-learn as this version of the pipeline uses
Comma separated list of unwanted taxa, to skip taxa filtering use "none"
string
mitochondria,chloroplast
Depending on the primers used, PCR might amplify unwanted or off-target DNA. By default sequences originating from mitochondria or chloroplasts are removed. The taxa specified are excluded from further analysis.
For example to exclude any taxa that contain mitochondria, chloroplast, or archaea:
--exclude_taxa "mitochondria,chloroplast,archaea"
If you prefer not filtering the data, specify:
--exclude_taxa "none"
Please note the following requirements:
- Comma separated list enclosed in quotes
- May not contain whitespace characters
- Features that contain one or several of these terms in their taxonomical classification are excluded from further analysis
- The taxonomy level is not taken into consideration
Abundance filtering
integer
1
Remove entries from the feature table below an absolute abundance threshold (default: 1, meaning filter is disabled). Singletons are often regarded as artifacts, choosing a value of 2 removes sequences with less than 2 total counts from the feature table.
For example to remove singletons choose:
--min_frequency 2
Prevalence filtering
integer
1
Filtering low prevalent features from the feature table, e.g. keeping only features that are present in at least two samples can be achived by choosing a value of 2 (default: 1, meaning filter is disabled). Typically only used when having replicates for all samples.
For example to retain features that are present in at least two sample:
--min_samples 2
Please note this is independent of abundance.
Skip FastQC
boolean
Skip all steps that are executed by QIIME2, including QIIME2 software download, taxonomy assignment by QIIME2, barplots, relative abundance tables, diversity analysis, differential abundance testing.
boolean
Skip taxonomic classification
boolean
Skip producing barplot
boolean
Skip producing any relative abundance tables
boolean
Skip alpha rarefaction
boolean
Skip alpha and beta diversity analysis
boolean
Skip differential abundance testing
boolean
Skip MultiQC reporting
boolean
Parameters used to describe centralised config profiles. These should not be edited.
Git commit id for Institutional configs.
string
master
Base directory for Institutional configs.
string
https://raw.githubusercontent.com/nf-core/configs/master
If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter.
Institutional configs hostname.
string
Institutional config name.
string
Institutional config description.
string
Institutional config contact information.
string
Institutional config URL link.
string
By default, parameters set as hidden in the schema are not shown on the command line when a user runs with --help
. Specifying this option will tell the pipeline to show all parameters.
MultiQC report title. Printed as page header, used for filename if not otherwise specified.
string
Less common options for the pipeline, typically set in a config file.
Display help text.
boolean
Method used to save pipeline results to output directory.
string
The Nextflow publishDir
option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See Nextflow docs for details.
Email address for completion summary, only when pipeline fails.
string
^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully.
Send plain-text email instead of HTML.
boolean
File size limit when attaching MultiQC reports to summary emails.
string
25.MB
^\d+(\.\d+)?\.?\s*(K|M|G|T)?B$
Do not use coloured log outputs.
boolean
Custom config file to supply to MultiQC.
string
Directory to keep pipeline Nextflow logs and reports.
string
${params.outdir}/pipeline_info
Boolean whether to validate parameters against the schema at runtime
boolean
true
Show all params when using --help
boolean
Run this workflow with Conda. You can also use '-profile conda' instead of providing this parameter.
boolean
Instead of directly downloading Singularity images for use with Singularity, force the workflow to pull and convert Docker containers instead.
boolean
This may be useful for example if you are unable to directly pull Singularity containers to run the pipeline due to http/https proxy issues.
Set the top limit for requested resources for any single job.
Maximum number of CPUs that can be requested for any single job.
integer
16
Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. --max_cpus 1
Maximum amount of memory that can be requested for any single job.
string
128.GB
^\d+(\.\d+)?\.?\s*(K|M|G|T)?B$
Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. --max_memory '8.GB'
Maximum amount of time that can be requested for any single job.
string
240.h
^(\d+\.?\s*(s|m|h|day)\s*)+$
Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. --max_time '2.h'