Path to tab-separated sample sheet

type: string

Path to sample sheet, either tab-separated (.tsv), comma-separated (.csv), or in YAML format (.yml/.yaml), that points to compressed fastq files.

The sample sheet must have two to four tab-separated columns/entries with the following headers:

  • sampleID (required): Unique sample IDs, must start with a letter, and can only contain letters, numbers or underscores
  • forwardReads (required): Paths to (forward) reads zipped FastQ files
  • reverseReads (optional): Paths to reverse reads zipped FastQ files, required if the data is paired-end
  • run (optional): If the data was produced by multiple sequencing runs, any string

Related parameters are:

  • --pacbio and --iontorrent if the sequencing data is PacBio data or IonTorrent data (default expected: paired-end Illumina data)
  • --single_end if the sequencing data is single-ended Illumina data (default expected: paired-end Illumina data)
  • Choose an appropriate reference taxonomy for the type of amplicon (16S/18S/ITS/CO1) (default: DADA2 assignTaxonomy and 16S rRNA sequence database)

Path to ASV/OTU fasta file

type: string

Path to fasta format file with sequences that will be taxonomically classified. The fasta file input option can be used to taxonomically classify previously produced ASV/OTU sequences.

The fasta sequence header line may contain a description, that will be kept as part of the sequence name. However, tabs will be changed into spaces.

Related parameters are:

  • Choose an appropriate reference taxonomy for the type of amplicon (16S/18S/ITS/CO1) (default: DADA2 assignTaxonomy and 16S rRNA sequence database)

Path to folder containing zipped FastQ files

type: string

Path to folder containing compressed fastq files.

Example for input data organization from one sequencing run with two samples, paired-end data:

data  
  ├─sample1_1_L001_R1_001.fastq.gz  
  ├─sample1_1_L001_R2_001.fastq.gz  
  ├─sample2_1_L001_R1_001.fastq.gz  
  └─sample2_1_L001_R2_001.fastq.gz  

Please note the following requirements:

  1. The path must be enclosed in quotes
  2. The folder must contain gzip compressed demultiplexed fastq files. If the file names do not follow the default ("/*_R{1,2}_001.fastq.gz"), please check --extension.
  3. Sample identifiers are extracted from file names, i.e. the string before the first underscore _, these must be unique
  4. If your data is scattered, produce a sample sheet
  5. All sequencing data should originate from one sequencing run, because processing relies on run-specific error models that are unreliable when data from several sequencing runs are mixed. Sequencing data originating from multiple sequencing runs requires additionally the parameter --multiple_sequencing_runs and a specific folder structure.

Related parameters are:

  • --pacbio and --iontorrent if the sequencing data is PacBio data or IonTorrent data (default expected: paired-end Illumina data)
  • --single_end if the sequencing data is single-ended Illumina data (default expected: paired-end Illumina data)
  • --multiple_sequencing_runs if the sequencing data originates from multiple sequencing runs
  • --extension if the sequencing file names do not follow the default ("/*_R{1,2}_001.fastq.gz")
  • Choose an appropriate reference taxonomy for the type of amplicon (16S/18S/ITS/CO1) (default: DADA2 assignTaxonomy and 16S rRNA sequence database)

Forward primer sequence

type: string

In amplicon sequencing methods, PCR with specific primers produces the amplicon of interest. These primer sequences need to be trimmed from the reads before further processing and are also required for producing an appropriate classifier. Do not use here any technical sequence such as adapter sequences but only the primer sequence that matches the biological amplicon.

For example:

--FW_primer "GTGYCAGCMGCCGCGGTAA" --RV_primer "GGACTACNVGGGTWTCTAAT"  

Reverse primer sequence

type: string

In amplicon sequencing methods, PCR with specific primers produces the amplicon of interest. These primer sequences need to be trimmed from the reads before further processing and are also required for producing an appropriate classifier. Do not use here any technical sequence such as adapter sequences but only the primer sequence that matches the biological amplicon.

For example:

--FW_primer GTGYCAGCMGCCGCGGTAA --RV_primer GGACTACNVGGGTWTCTAAT  

Path to metadata sheet, when missing most downstream analysis are skipped (barplots, PCoA plots, ...).

type: string

This is optional, but for performing downstream analysis such as barplots, diversity indices or differential abundance testing, a metadata file is essential.

Related parameter:

  • --metadata_category (optional) to choose columns that are used for testing significance

For example:

--metadata "path/to/metadata.tsv"  

Please note the following requirements:

  1. The path must be enclosed in quotes
  2. The metadata file has to follow the QIIME2 specifications (https://docs.qiime2.org/2021.2/tutorials/metadata/)

The first column in the tab-separated metadata file is the sample identifier column (required header: ID) and defines the sample or feature IDs associated with your study. In addition to the sample identifier column, the metadata file is required to have at least one column with multiple different non-numeric values but not all unique.
NB: without additional columns there might be no groupings for the downstream analyses.

Sample identifiers should be 36 characters long or less, and also contain only ASCII alphanumeric characters (i.e. in the range of [a-z], [A-Z], or [0-9]), or the dash (-) character. For downstream analysis, by default all numeric columns, blanks or NA are removed, and only columns with multiple different values but not all unique are selected.

The columns which are to be assessed can be specified by --metadata_category. If --metadata_category isn't specified than all columns that fit the specification are automatically chosen.

Path to multi-region definition sheet, for multi-region analysis with Sidle

type: string

Path to file with information about sequenced regions, either tab-separated (.tsv), comma-separated (.csv), or in YAML format (.yml/.yaml). This initiates scaffolding multiple regions along a reference.

The file must have four headers:

  • region: Unique region identifier
  • region_length: Minimal length of region
  • FW_primer: Forward primer sequence
  • RV_primer: Reverse primer sequence

For more details check the usage documentation.

Related parameters are:

  • --sidle_ref_taxonomy to select the reference taxonomic database
  • --sidle_ref_tax_custom for custom reference taxonomic database files
  • --sidle_ref_tree_custom for custom phylogenetic tree of reference taxonomic database

The output directory where the results will be saved. You have to use absolute paths to storage on Cloud infrastructure.

required
type: string

Save intermediate results such as QIIME2's qza and qzv files

type: boolean

Email address for completion summary.

type: string
pattern: ^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$

Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file (~/.nextflow/config) then you don't need to specify this on the command line for every run.

If data has binned quality scores such as Illumina NovaSeq

type: boolean

If data is single-ended PacBio reads instead of Illumina

type: boolean

If data is single-ended IonTorrent reads instead of Illumina

type: boolean

If data is single-ended Illumina reads instead of paired-end

type: boolean

When using a sample sheet with --input containing forward and reverse reads, specifying --single_end will only extract forward reads and treat the data as single ended instead of extracting forward and reverse reads.

If analysing ITS amplicons or any other region with large length variability with Illumina paired end reads

type: boolean

This will cause the pipeline to

  • not truncate input reads if not --trunclenf and --trunclenr are overwriting defaults
  • remove reverse complement primers from the end of reads in case the read length exceeds the amplicon length

If using --input_folder: samples were sequenced in multiple sequencing runs

type: boolean

Expects one sub-folder per sequencing run in the folder specified by --input_folder containing sequencing data of the specific run.
Sample identifiers are taken from sequencing files, specifically the string before the first underscore will be the sample ID. Sample IDs across all sequencing runs (all sequencing files) have to be unique. If this is not the case, please use a sample sheet as input instead.

Example for input data organization:

data  
  |-run1  
  |  |-sample1_1_L001_R1_001.fastq.gz  
  |  |-sample1_1_L001_R2_001.fastq.gz  
  |  |-sample2_1_L001_R1_001.fastq.gz  
  |  |-sample2_1_L001_R2_001.fastq.gz  
  |  
  |-run2  
     |-sample3_1_L001_R1_001.fastq.gz  
     |-sample3_1_L001_R2_001.fastq.gz  
     |-sample4_1_L001_R1_001.fastq.gz  
     |-sample4_1_L001_R2_001.fastq.gz  

Example command to analyze this data in one pipeline run:

nextflow run nf-core/ampliseq \  
    -profile singularity \  
    --input_folder "data" \  
    --FW_primer "GTGYCAGCMGCCGCGGTAA" \  
    --RV_primer "GGACTACNVGGGTWTCTAAT" \  
    --metadata "data/Metadata.tsv" \  
    --multiple_sequencing_runs  

If using --input_folder: naming of sequencing files

type: string
default: /*_R{1,2}_001.fastq.gz

Indicates the naming of sequencing files (default: "/*_R{1,2}_001.fastq.gz").

Please note:

  1. The prepended slash (/) is required
  2. The star (*) is the required wildcard for sample names
  3. The curly brackets ({}) enclose the orientation for paired end reads, separated by a comma (,).
  4. The pattern must be enclosed in quotes

For example for one sample (name: 1) with forward (file: 1_a.fastq.gz) and reverse (file: 1_b.fastq.gz) reads in folder data:

--input_folder "data" --extension "/*_{a,b}.fastq.gz"  

Set read count threshold for failed samples.

type: integer
default: 1

Samples with less reads than this threshold at input or after trimming stop the pipeline. Using --ignore_empty_input_files or --ignore_failed_trimming ignores samples with read numbers below the threshold and lets the pipeline continue with less samples.

Ignore input files with too few reads.

type: boolean

Ignore input files with less reads than specified by --min_read_counts and continue the pipeline without those samples.

Spurious sequences sometimes lack primer sequences and primers introduce errors that can be removed in that step

Cutadapt will retain untrimmed reads, choose only if input reads are not expected to contain primer sequences.

type: boolean

When read sequences are trimmed, untrimmed read pairs are discarded routinely. Use this option to retain untrimmed read pairs. This is usually not recommended and is only of advantage for specific protocols that prevent sequencing PCR primers.

Sets the minimum overlap for valid matches of primer sequences with reads for cutadapt (-O).

type: integer
default: 3

Sets the maximum error rate for valid matches of primer sequences with reads for cutadapt (-e).

type: number
default: 0.1

Cutadapt will be run twice to ensure removal of potential double primers

type: boolean

Cutdapt will be run twice, first to remove reads without primers (default), then a second time to remove reads that erroneously contain a second set of primers, not to be used with --retain_untrimmed.

Ignore files with too few reads after trimming.

type: boolean

Ignore files with less reads than specified by --min_read_counts after trimming and continue the pipeline without those samples.

Read trimming and quality filtering is supposed to reduce spurious results and aid error correction

DADA2 read truncation value for forward strand, set this to 0 for no truncation

type: integer

Read denoising by DADA2 creates an error profile specific to a sequencing run and uses this to correct sequencing errors. This method prefers when all reads to have the same length and as high quality as possible while maintaining at least 20 bp overlap for merging. One cutoff for the forward read --trunclenf and one for the reverse read --trunclenr truncate all longer reads at that position and drop all shorter reads.
If not set, these cutoffs will be determined automatically for the position before the mean quality score drops below --trunc_qmin.

For example:

--trunclenf 180 --trunclenr 120  

Please note:

  1. Overly aggressive truncation might lead to insufficient overlap for read merging
  2. Too little truncation might reduce denoised reads
  3. The code choosing these values automatically cannot take the points above into account, therefore checking read numbers is essential

DADA2 read truncation value for reverse strand, set this to 0 for no truncation

type: integer

Read denoising by DADA2 creates an error profile specific to a sequencing run and uses this to correct sequencing errors. This method prefers when all reads to have the same length and as high quality as possible while maintaining at least 20 bp overlap for merging. One cutoff for the forward read --trunclenf and one for the reverse read --trunclenr truncate all longer reads at that position and drop all shorter reads.
If not set, these cutoffs will be determined automatically for the position before the mean quality score drops below --trunc_qmin.

For example:

--trunclenf 180 --trunclenr 120  

Please note:

  1. Overly aggressive truncation might lead to insufficient overlap for read merging
  2. Too little truncation might reduce denoised reads
  3. The code choosing these values automatically cannot take the points above into account, therefore checking read numbers is essential

If --trunclenf and --trunclenr are not set, these values will be automatically determined using this median quality score

type: integer
default: 25

Automatically determine --trunclenf and --trunclenr before the median quality score drops below --trunc_qmin. The fraction of reads retained is defined by --trunc_rmin, which might override the quality cutoff.

For example:

--trunc_qmin 35  

Please note:

  1. The code choosing --trunclenf and --trunclenr using --trunc_qmin automatically cannot take amplicon length or overlap requirements for merging into account, therefore use with caution.
  2. A minimum value of 25 is recommended. However, high quality data with a large paired sequence overlap might justify a higher value (e.g. 35). Also, very low quality data might require a lower value.
  3. If the quality cutoff is too low to include a certain fraction of reads that is specified by --trunc_rmin (e.g. 0.75 means at least 75% percent of reads are retained), a lower cutoff according to --trunc_rmin superseeds the quality cutoff.

Assures that values chosen with --trunc_qmin will retain a fraction of reads.

type: number
default: 0.75

Value can range from 0 to 1. 0 means no reads need to be retained and 1 means all reads need to be retained. The minimum lengths of --trunc_qmin and --trunc_rmin are chosen as DADA2 cutoffs.

DADA2 read filtering option

type: integer
default: 2

After truncation, reads with higher than max_ee "expected errors" will be discarded. In case of very long reads, you might want to increase this value. We recommend (to start with) a value corresponding to approximately 1 expected error per 100-200 bp (default: 2)

DADA2 read filtering option

type: integer
default: 50

Remove reads with length less than min_len after trimming and truncation.

DADA2 read filtering option

type: integer

Remove reads with length greater than max_len after trimming and truncation. Must be a positive integer.

Ignore files with too few reads after quality filtering.

type: boolean

Ignore files with fewer reads than specified by --min_read_counts after trimming and continue the pipeline without those samples. Please review all quality trimming and filtering options before using this parameter. For example, one sample with shorter sequences than other samples might loose all sequences due to minimum length requirements by read truncation (see --trunclenf).

Mode of sample inference: "independent", "pooled" or "pseudo"

type: string

If samples are treated independent (lowest sensitivity and lowest resources), pooled (highest sensitivity and resources) or pseudo-pooled (balance between required resources and sensitivity).

Not recommended: When paired end reads are not sufficiently overlapping for merging.

type: boolean

This parameters specifies that paired-end reads are not merged after denoising but concatenated (separated by 10 N's). This is of advantage when an amplicon was sequenced that is too long for merging (i.e. bad experimental design). This is an alternative to only analyzing the forward or reverse read in case of non-overlapping paired-end sequencing data.

This parameter is not recommended! Only if all other options fail.

ASV post-processing takes place after ASV computation but before taxonomic assignment, it will affect all downstream processes

Post-cluster ASVs with VSEARCH

type: boolean

ASVs will be clustered with VSEARCH using the id value found in --vsearch_cluster_id.

Pairwise Identity value used when post-clustering ASVs if --vsearch_cluster option is used (default: 0.97).

type: number
default: 0.97

Lowering or increasing this value can change the number ASVs left over after clustering.

Raise stack size when filtering VSEARCH clusters

type: boolean
default: true

Setting to true adds 'ulimit -s unlimited' to the beginning of the filt_clusters.py command.

Enable SSU filtering. Comma separated list of kingdoms (domains) in Barrnap, a combination (or one) of "bac", "arc", "mito", and "euk". ASVs that have their lowest evalue in that kingdoms are kept.

type: string

Minimal ASV length

type: integer

Remove ASV that are below the minimum length threshold (default: filter is disabled, otherwise 1). Increasing the threshold might reduce false positive ASVs (e.g. PCR off-targets).

Maximum ASV length

type: integer

Remove ASV that are above the maximum length threshold (default: filter is disabled, otherwise 1000000). Lowering the threshold might reduce false positive ASVs (e.g. PCR off-targets).

Filter ASVs based on codon usage

type: boolean

ASVs will be filtered to contain no stop codon in their coding sequence and that their length is a multiple of 3.

Starting position of codon tripletts

type: integer
default: 1

By default, when --filter_codons is set, the codons start from the first position of the ASV sequences. The start of the codons can be changed to any position.

Ending position of codon tripletts

type: integer

By default, when --filter_codons is set, the codons are checked until the end of the ASV sequences. If you would like to change this setting, you can specify until which position of the ASV sequences the codon triplets are checked.

Please note that the length of the ASV from the beginning or from the --orf_start until this position must be a multiple of 3.

Define stop codons

type: string
default: TAA,TAG

By default, when --filter_codons is set, the codons TAA,TAG are set as stop codons. Here you can specify any comma-separated list of codons to be used as stop codons, e.g. --stop_codons "TAA,TAG,TGA"

Choose a method and database for taxonomic assignments to single-region amplicons

Name of supported database, and optionally also version number

type: string

Choose any of the supported databases, and optionally also specify the version. Database and version are separated by an equal sign (=, e.g. silva=138) . This will download the desired database, format it to produce a file that is compatible with DADA2's assignTaxonomy and another file that is compatible with DADA2's addSpecies.

The following databases are supported:

  • GTDB - Genome Taxonomy Database - 16S rRNA
  • SBDI-GTDB, a Sativa-vetted version of the GTDB 16S rRNA
  • PR2 - Protist Reference Ribosomal Database - 18S rRNA
  • RDP - Ribosomal Database Project - 16S rRNA
  • SILVA ribosomal RNA gene database project - 16S rRNA
  • UNITE - eukaryotic nuclear ribosomal ITS region - ITS
  • COIDB - eukaryotic Cytochrome Oxidase I (COI) from The Barcode of Life Data System (BOLD) - COI

Generally, using gtdb, pr2, rdp, sbdi-gtdb, silva, coidb, unite-fungi, or unite-alleuk will select the most recent supported version.

Please note that commercial/non-academic entities require licensing for SILVA v132 database (non-default) but not from v138 on (default).

Path to a custom DADA2 reference taxonomy database

type: string

Overwrites --dada_ref_taxonomy. Either --skip_dada_addspecies (no species annotation) or --dada_ref_tax_custom_sp (species annotation) is additionally required. Consider also setting --dada_assign_taxlevels.

Must be compatible to DADA2's assignTaxonomy function: 'Can be compressed. This reference fasta file should be formatted so that the id lines correspond to the taxonomy (or classification) of the associated sequence, and each taxonomic level is separated by a semicolon.' See also https://rdrr.io/bioc/dada2/man/assignTaxonomy.html

Path to a custom DADA2 reference taxonomy database for species assignment

type: string

Requires --dada_ref_tax_custom. Must be compatible to DADA2's addSpecies function: 'Can be compressed. This reference fasta file should be formatted so that the id lines correspond to the genus-species binomial of the associated sequence.' See also https://rdrr.io/bioc/dada2/man/addSpecies.html

Comma separated list of taxonomic levels used in DADA2's assignTaxonomy function

type: string

Typically useful when providing a custom DADA2 reference taxonomy database with --dada_ref_tax_custom. If DADA2's addSpecies is used (default), the last element(s) of the comma separated string must be 'Genus' or 'Genus,Species'.

If the expected amplified sequences are extracted from the DADA2 reference taxonomy database

type: boolean

Expected amplified sequences are extracted from the DADA2 reference taxonomy using the primer sequences, that might improve classification. This is not applied to species classification (assignSpecies) but only for lower taxonomic levels (assignTaxonomy).

If multiple exact matches against different species are returned

type: boolean

Defines the behavior when multiple exact matches against different species are returned. By default only unambiguous identifications are returned. If TRUE, a concatenated string of all exactly matched species is returned.

If reverse-complement of each sequences will be also tested for classification

type: boolean

Reverse-complement of each sequences will be used for classification if it is a better match to the reference sequences than the forward sequence.

Newick file with reference phylogenetic tree. Requires also --pplace_aln and --pplace_model.

type: string

File with reference sequences. Requires also --pplace_tree and --pplace_model.

type: string

Phylogenetic model to use in placement, e.g. 'LG+F' or 'GTR+I+F'. Requires also --pplace_tree and --pplace_aln.

type: string

Method used for alignment, "hmmer" or "mafft"

type: string

Tab-separated file with taxonomy assignments of reference sequences.

type: string

Headerless, tab-separated, first column with tree leaves, second column with taxonomy ranks separated by semicolon ;. The results take precedence over DADA2 and QIIME2 classifications.

A name for the run

hidden
type: string

Name of supported database, and optionally also version number

type: string

Choose any of the supported databases, and optionally also specify the version. Database and version are separated by an equal sign (=, e.g. silva=138) . This will download the desired database and initiate taxonomic classification with QIIME2 and the chosen database.

If both, --dada_ref_taxonomy and --qiime_ref_taxonomy are used, DADA2 classification will be used for downstream analysis.

The following databases are supported:

  • SILVA ribosomal RNA gene database project - 16S rRNA
  • UNITE - eukaryotic nuclear ribosomal ITS region - ITS
  • Greengenes (only testing!)

Generally, using silva, unite-fungi, or unite-alleuk will select the most recent supported version. For testing purposes, the tiny database greengenes85 (dereplicated at 85% sequence similarity) is available. For details on what values are valid, please either use an invalid value such as x (causing the pipeline to send an error message with all valid values) or see conf/ref_databases.config.

Path to files of a custom QIIME2 reference taxonomy database (tarball, or two comma-separated files)

type: string

Overwrites --qiime_ref_taxonomy. Either path to tarball (*.tar.gz or *.tgz) that contains sequence (*.fna) and taxonomy (*.tax) data, or alternatively a comma separated pair of filepaths to sequence (*.fna) and taxonomy (*.tax) data (possibly gzipped *.gz).

Path to QIIME2 trained classifier file (typically *-classifier.qza)

type: string

If you have trained a compatible classifier before, from sources such as SILVA (https://www.arb-silva.de/), Greengenes (http://greengenes.secondgenome.com/downloads) or RDP (https://rdp.cme.msu.edu/).

For example:

--classifier "FW_primer-RV_primer-classifier.qza"  

Please note the following requirements:

  1. The path must be enclosed in quotes
  2. The classifier is a Naive Bayes classifier produced by qiime feature-classifier fit-classifier-naive-bayes (e.g. by this pipeline)
  3. The primer pair for the amplicon PCR and the computing of the classifier are exactly the same (or full-length, potentially lower performance)
  4. The classifier has to be trained by the same version of scikit-learn as this version of the pipeline uses

Name of supported database, and optionally also version number

type: string

Choose any of the supported databases, and optionally also specify the version. Database and version are separated by an equal sign (=, e.g. silve=138) . This will download the desired database and initiate taxonomic classification with Kraken2 and the chosen database.

Consider using --kraken2_confidence to set a confidence score threshold.

The following databases are supported:

  • RDP - Ribosomal Database Project - 16S rRNA
  • SILVA ribosomal RNA gene database project - 16S rRNA
  • Greengenes - 16S rRNA
  • Standard Kraken2 database (RefSeq archaea, bacteria, viral, plasmid, human, UniVec_Core) - any amplicon

Generally, using rdp, silva, greengenes, standard will select the most recent supported version.

Please note that commercial/non-academic entities require licensing for SILVA v132 database (non-default) but not from v138 on.

Path to a custom Kraken2 reference taxonomy database (.tar.gz|.tgz archive or folder)

type: string

Overwrites --kraken2_ref_taxonomy. Consider also setting --kraken2_assign_taxlevels. Can be compressed tar archive (.tar.gz|.tgz) or folder containing the database. See also https://benlangmead.github.io/aws-indexes/k2.

Comma separated list of taxonomic levels used in Kraken2. Will overwrite default values.

type: string

Typically useful when providing a custom Kraken2 reference taxonomy database with --kraken2_ref_tax_custom. In case a database is given with --kraken2_ref_taxonomy, the default taxonomic levels will be overwritten with --kraken2_assign_taxlevels.

Confidence score threshold for taxonomic classification.

type: number

Increasing the threshold will require more k-mers to match at a taxonomic levels and reduce the taxonomic levels shown until the threshold is met.

Name of supported database, and optionally also version number

type: string

Choose any of the supported databases, and optionally also specify the version. Database and version are separated by an equal sign (=, e.g. coidb=221216) . This will download the desired database and initiate taxonomic classification with VSEARCH sintax and the chosen database, which if needed is formatted to produce a file that is compatible with VSEARCH sintax.

The following databases are supported:

  • COIDB - eukaryotic Cytochrome Oxidase I (COI) from The Barcode of Life Data System (BOLD) - COI
  • UNITE - eukaryotic nuclear ribosomal ITS region - ITS

Generally, using coidb, unite-fungi, or unite-alleuk will select the most recent supported version.

If ASVs should be assigned to UNITE species hypotheses (SHs). Only relevant for ITS data.

type: boolean

Part of ITS region to use for taxonomy assignment: "full", "its1", or "its2"

type: string

If data is long read ITS sequences, that need to be cut to ITS region (full ITS, only ITS1, or only ITS2) for taxonomy assignment.

Cutoff for partial ITS sequences. Only full sequences by default.

type: integer

If using cut_its, this option allows partial ITS sequences, longer than the specified cutoff.

Choose database for taxonomic assignments with multi-region amplicons using SIDLE

Name of supported database, and optionally also version number

type: string

Comma separated paths to three files: reference taxonomy sequences (.fasta), reference taxonomy strings (.txt)

type: string

Consider also setting --sidle_ref_tree_custom. Example usage: --sidle_ref_tax_custom 'rep_set_99.fasta,rep_set_aligned_99.fasta,taxonomy_99_taxonomy.txt'

Path to SIDLE reference taxonomy tree (*.qza)

type: string
pattern: ^.*\.qza$

Overwrites tree chosen by --sidle_ref_taxonomy

Filtering by taxonomy or abundance will affect all downstream analysis

Comma separated list of unwanted taxa, to skip taxa filtering use "none"

type: string
default: mitochondria,chloroplast

Depending on the primers used, PCR might amplify unwanted or off-target DNA. By default sequences originating from mitochondria or chloroplasts are removed. The taxa specified are excluded from further analysis.
For example to exclude any taxa that contain mitochondria, chloroplast, or archaea:

--exclude_taxa "mitochondria,chloroplast,archaea"  

If you prefer not filtering the data, specify:

--exclude_taxa "none"  

Please note the following requirements:

  1. Comma separated list enclosed in quotes
  2. May not contain whitespace characters
  3. Features that contain one or several of these terms in their taxonomical classification are excluded from further analysis
  4. The taxonomy level is not taken into consideration
  5. Taxa names should be as in Taxonomic database (Default: Silva138), example: 'Bacteria', 'Armatimonadia', 'unidentified', 'p__'
  6. Taxon names are case-insensitive and partial match is possible.

Abundance filtering

type: integer
default: 1

Remove entries from the feature table below an absolute abundance threshold (default: 1, meaning filter is disabled). Singletons are often regarded as artifacts, choosing a value of 2 removes sequences with less than 2 total counts from the feature table.

For example to remove singletons choose:

--min_frequency 2  

Prevalence filtering

type: integer
default: 1

Filtering low prevalent features from the feature table, e.g. keeping only features that are present in at least two samples can be achived by choosing a value of 2 (default: 1, meaning filter is disabled). Typically only used when having replicates for all samples.

For example to retain features that are present in at least two sample:

--min_samples 2  

Please note this is independent of abundance.

Metadata is used here to visualize data either for quality control or publication ready figures

Comma separated list of metadata column headers for statistics.

type: string

Here columns in the metadata sheet can be chosen with groupings that are used for diversity indices and differential abundance analysis. By default, all suitable columns in the metadata sheet will be used if this option is not specified. Suitable are columns which are categorical (not numerical) and have multiple different values which are not all unique. For example:

--metadata_category "treatment1,treatment2"  

Please note the following requirements:

  1. Comma separated list enclosed in quotes
  2. May not contain whitespace characters
  3. Each comma separated term has to match exactly one column name in the metadata sheet

Comma separated list of metadata column headers for plotting average relative abundance barplots.

type: string

Here columns in the metadata sheet can be chosen with groupings that are used for average relative abundance barplots. Samples that have empty fields for that column are discarded. For example:

--metadata_category_barplot "treatment1,treatment2"  

Please note the following requirements:

  1. Comma separated list enclosed in quotes
  2. May not contain whitespace characters
  3. Each comma separated term has to match exactly one column name in the metadata sheet

Formula for QIIME2 ADONIS metadata feature importance test for beta diversity distances

type: string

Comma separated list of model formula(s), e.g. "treatment1,treatment2". Model formula should contain only independent terms in the sample metadata. These can be continuous variables or factors, and they can have interactions as in a typical R formula. Essentially, columns in the metadata sheet can be chosen that have no empty values, not only unique values, or not only identical values.
For example, "treatment1+treatment2" tests whether the data partitions based on "treatment1" and "treatment2" sample metadata. "treatment1*treatment2" test both of those effects as well as their interaction.
More examples can be found in the R documentation, https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Formulae-for-statistical-models

If the functional potential of the bacterial community is predicted.

type: boolean

If data should be exported in SBDI (Swedish biodiversity infrastructure) Excel format.

type: boolean

Minimum rarefaction depth for diversity analysis. Any sample below that threshold will be removed.

type: integer
default: 500

Minimum taxonomy agglomeration level for taxonomic classifications

type: integer
default: 2

Depends on the reference taxonomy database used.

Maximum taxonomy agglomeration level for taxonomic classifications

type: integer
default: 6

Depends on the reference taxonomy database used. Most default databases have genus level at 6.

Differential abundance analysis relies on provided metadata

Minimum sample counts to retain a sample for ANCOM analysis. Any sample below that threshold will be removed.

type: integer
default: 1

Perform differential abundance analysis with ANCOM

type: boolean

Perform differential abundance analysis with ANCOMBC

type: boolean

ANCOMBC will be performed on all suitable columns in the metadata sheet. Empty values will be removed, therefore it is possible to perform tests on subsets. The reference level will default to highest alphanumeric group (e.g. in alphabetical or numeric order, as applicable) within each metadata column. Formula for specific tests can be supplied with --ancombc_formula.

Formula to perform differential abundance analysis with ANCOMBC

type: string

Comma separated list of model formula(s), e.g. "treatment1,treatment2". The reference level will default to highest alphanumeric group (e.g. in alphabetical or numeric order, as applicable) within each formula term. The reference level can be overwritten by --ancombc_formula_reflvl. Model formula should contain only independent terms in the sample metadata. These can be continuous variables or factors, and they can have interactions as in a typical R formula. Essentially, columns in the metadata sheet can be chosen that have no empty values, not only unique values, or not only identical values.
For example, "treatment1+treatment2" tests whether the data partitions based on "treatment1" and "treatment2" sample metadata. "treatment1*treatment2" test both of those effects as well as their interaction.
More examples can be found in the R documentation, https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Formulae-for-statistical-models

Reference level for --ancombc_formula

type: string

This will only affect ANCOM-BC started by --ancombc_formula, but for all provided model formula, therefore it might be best to restrict --ancombc_formula to one formula. The syntax is as follows: 'column_name::column_value' or for multiple 'column_name1::column_value1 column_name2::column_value2'

Effect size threshold for differential abundance barplot for --ancombc and --ancombc_formula

type: number
default: 1

Significance threshold for differential abundance barplot for --ancombc and --ancombc_formula

type: number
default: 0.05

Customization of the pipeline report

Path to Markdown file (Rmd)

type: string
default: ${projectDir}/assets/report_template.Rmd

Path to style file (css)

type: string
default: ${projectDir}/assets/nf-core_style.css

Path to logo file (png)

type: string
default: ${projectDir}/assets/nf-core-ampliseq_logo_light_long.png

String used as report title

type: string
default: Summary of analysis results

Path to Markdown file (md) that replaces the 'Abstract' section

type: string

Skip FastQC

type: boolean

Skip primer trimming with cutadapt. This is not recommended! Use only in case primer sequences were removed before and the data does not contain any primer sequences.

type: boolean

Skip quality check with DADA2. Can only be skipped when --trunclenf and --trunclenr are set.

type: boolean

Skip annotating SSU matches.

type: boolean

Skip all steps that are executed by QIIME2, including QIIME2 software download, taxonomy assignment by QIIME2, barplots, relative abundance tables, diversity analysis, differential abundance testing.

type: boolean

Skip steps that are executed by QIIME2 except for taxonomic classification. Skip steps including barplots, relative abundance tables, diversity analysis, differential abundance testing.

type: boolean

Skip taxonomic classification. Incompatible with --sbdiexport

type: boolean

Skip taxonomic classification with DADA2

type: boolean

Skip species level when using DADA2 for taxonomic classification. This reduces the required memory dramatically under certain conditions. Incompatible with --sbdiexport

type: boolean

Skip producing barplot

type: boolean

Skip producing any relative abundance tables

type: boolean

Skip alpha rarefaction

type: boolean

Skip alpha and beta diversity analysis

type: boolean

Skip MultiQC reporting

type: boolean

Skip Markdown summary report

type: boolean

Less common options for the pipeline, typically set in a config file.

Specifies the random seed.

type: integer
default: 100

Display version and exit.

hidden
type: boolean

Method used to save pipeline results to output directory.

hidden
type: string

The Nextflow publishDir option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See Nextflow docs for details.

Email address for completion summary, only when pipeline fails.

hidden
type: string
pattern: ^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$

An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully.

Send plain-text email instead of HTML.

hidden
type: boolean

File size limit when attaching MultiQC reports to summary emails.

hidden
type: string
default: 25.MB
pattern: ^\d+(\.\d+)?\.?\s*(K|M|G|T)?B$

Do not use coloured log outputs.

hidden
type: boolean

Incoming hook URL for messaging service

hidden
type: string

Incoming hook URL for messaging service. Currently, MS Teams and Slack are supported.

Custom config file to supply to MultiQC.

hidden
type: string

Custom logo file to supply to MultiQC. File name must also be set in the MultiQC config file

hidden
type: string

Custom MultiQC yaml file containing HTML including a methods description.

type: string

Boolean whether to validate parameters against the schema at runtime

hidden
type: boolean
default: true

Base URL or local path to location of pipeline test dataset files

hidden
type: string
default: https://raw.githubusercontent.com/nf-core/test-datasets/

Parameters used to describe centralised config profiles. These should not be edited.

Git commit id for Institutional configs.

hidden
type: string
default: master

Base directory for Institutional configs.

hidden
type: string
default: https://raw.githubusercontent.com/nf-core/configs/master

If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter.

Institutional config name.

hidden
type: string

Institutional config description.

hidden
type: string

Institutional config contact information.

hidden
type: string

Institutional config URL link.

hidden
type: string

By default, parameters set as hidden in the schema are not shown on the command line when a user runs with --help. Specifying this option will tell the pipeline to show all parameters.

MultiQC report title. Printed as page header, used for filename if not otherwise specified.

hidden
type: string