funcscan: Parameters

Define where the pipeline should find input data and save output data.

Path to comma-separated file containing sample names and paths to corresponding FASTA files, and optional annotation files.

type: string

pattern: ^\S+\.csv$

Before running the pipeline, you will need to create a design file with information about the samples to be scanned by nf-core/funcscan, containing at a minimum sample names and paths to contigs. Use this parameter to specify its location. It has to be a two or four column comma-separated file with a header row (sample,fasta or sample,fasta,protein,gbk). See usage docs.

The output directory where the results will be saved. You have to use absolute paths to storage on Cloud infrastructure.

type: string

Email address for completion summary.

type: string

pattern: ^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$

Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file (~/.nextflow/config) then you don't need to specify this on the command line for every run.

MultiQC report title. Printed as page header, used for filename if not otherwise specified.

type: string

These parameters influence which workflow (ARG, AMP and/or BGC) to activate.

Activate antimicrobial peptide genes screening tools.

type: boolean

Activate antimicrobial resistance gene screening tools.

type: boolean

Activate biosynthetic gene cluster screening tools.

type: boolean

These options influence whether to activate the taxonomic classification of the input nucleotide sequences.

Activates the taxonomic classification of input nucleotide sequences.

type: boolean

This flag turns on the taxonomic classification of input nucleotide sequences. The taxonomic annotations should be turned on if the input metagenomes' bacterial sources are unknown, which can help identify the source of the AMP, BGC or ARG hit obtained for laboratory experiments. This flag should be turned off (which is by default) if the input nucleotide sequences represent a single known genome or nf-core/mag was run beforehand. Turning on this flag relatively decreases the pipeline speed and requires >8GB RAM. Due to the size of the resulting table, the final summary is in a zipped format.

Specifies the tool used for taxonomic classification.

type: string

This flag specifies which tool for taxonomic classification should be activated. At the moment only 'MMseqs2' is incorporated in the pipeline.

If MMseqs2 is chosen as taxonomic classification tool: Specifies if the output of all MMseqs2 subcommands shall be compressed.

type: boolean

To compress MMseqs2 output files, choose true otherwise leave to false. Compressing output files can lead to errors when the output is actually empty. In that case, just leave this parameter to its default value. More details can be found in the documentation (GitHub).

Modifies tool parameter(s):

mmseqs createdb --compressed <0|1>

mmseqs createtsv --compressed <0|1>

mmseqs databases --compressed <0|1>

mmseqs taxonomy --compressed <0|1>

These parameters influence the database to be used in classifying the taxonomy.

Specify a path to MMseqs2-formatted database.

type: string

Specify a path to a database that is prepared in MMseqs2 format as detailed in the documentation.

The contents of the directory should have files such as <dbname>.version and <dbname>.taxonomy in the top level.

Specify the label of the database to be used.

type: string

default: Kalamari

Specify which MMseqs2-formatted database to use to classify the input contigs. This can be a nucleotide or amino acid database that includes taxonomic classifications. For example, both GTDB (an amico acid database) and SILVA (a nucleotide database) are supported by MMseqs2. More details can be found in the documentation.

Modifies tool parameter(s):

mmseqs databases <name>

Specify whether the temporary files should be saved.

type: boolean

This flag saves the temporary files from downloading the database and formatting it in the MMseqs2 format into the output folder. More details can be found in the documentation.

Modifies tool parameter(s):

mmseqs databases: --remove-tmp-files

These parameters influence the taxonomic classification step.

Specify whether to save the temporary files.

type: boolean

This flag saves the temporary files from creating the taxonomy database and the final tsv file into the output folder. More details can be found in the documentation.

Modifies tool parameter(s):

mmseqs taxonomy: --remove-tmp-files

Specify the alignment type between database and query.

type: integer

default: 2

Specify the type of alignment to be carried out between the query database and the reference MMseqs2 database. This can be set to '0' for automatic detection, '1' for amino acid alignment, '2' for translating the inputs and running the alignment on the translated sequences, '3' nucleotide based alignment and '4' for the translated nucleotide sequences alignment. More details can be found in the documentation.

Modifies tool parameter(s):

mmseqs taxonomy: --search-type

Specify the taxonomic levels to display in the result table.

type: string

default: kingdom,phylum,class,order,family,genus,species

Specify the taxonomic ranks to include in the taxonomic lineage column in the final .tsv file. For example, 'kingdom,phylum,class,order,family,genus,species'. More details can be found in the documentation.

Modifies tool parameter(s):

mmseqs taxonomy: --lca-ranks

Specify whether to include or remove the taxonomic lineage.

type: integer

default: 1

This flag specifies whether the taxonomic lineage should be included in the output .tsv file. The taxonomic lineage is obtained from the internal module of mmseqs/taxonomy that infers the last common ancestor to classify the taxonomy. A value of '0' writes no taxonomic lineage, a value of '1' adds a column with the full lineage names prefixed with abbreviation of the lineage level, e.g. k_Prokaryotes;p_Bacteroidetes;c_....;o_....;f_....;g_....;s_...., while a value of '2' adds a column with the full NCBI taxids lineage,e.g. 1324;2345;4546;5345. More details can be found in the documentation.

Modifies tool parameter(s):

mmseqs taxonomy: --tax-lineage

Specify the speed and sensitivity for taxonomy assignment.

type: number

default: 5

This flag specifies the speed and sensitivity of the taxonomic search. It stands for how many kmers should be produced during the preliminary seeding stage. A very fast search requires a low value e.g. '1.0' and a a very sensitive search requires e.g. '7.0'. More details can be found in the documentation.

Modifies tool parameter(s):

mmseqs taxonomy: -s

Specify the ORF search sensitivity in the prefilter step.

type: number

default: 2

This flag specifies the sensitivity used for prefiltering the query ORF. Before the taxonomy-assigning step, MMseqs2 searches the predicted ORFs against the provided database. This value influences the speed with which the search is carried out. More details can be found in the documentation.

Modifies tool parameter(s):

mmseqs taxonomy: --orf-filter-s

Specify the mode to assign the taxonomy.

type: integer

default: 3

This flag specifies the strategy used for assigning the last common ancestor (LCA). MMseqs2 assigns taxonomy based on an accelerated approximation of the 2bLCA protocol and uses the value of '3'. In this mode, the taxonomic assignment is based not only on usual alignment parameters but also considers the taxonomic classification of the LCA. When the value '4' is used the LCA is assigned based on all the equal scoring top hits. If the value '1' is used the LCA assignment is disregarded and the taxonomic assignment is based on usual alignment parameters like E-value and coverage. More details can be found in the documentation.

Modifies tool parameter(s):

mmseqs taxonomy: --lca-mode

Specify the weights of the taxonomic assignment.

type: integer

default: 1

This flag assigns the mode value with which the weights are computed. The value of '0' stands for uniform weights of taxonomy assignments, the value of '1' uses the minus log E-value and '2' the actual score. More details can be found in the documentation.

Modifies tool parameter(s):

mmseqs taxonomy: --vote-mode

These options influence the generation of annotation files required for downstream steps in ARG, AMP, and BGC workflows.

Specify which annotation tool to use for some downstream tools.

type: string

Specify whether to save gene annotations in the results directory.

type: boolean

BAKTA is a tool developed to annotate bacterial genomes and plasmids from both isolates and MAGs. More info: https://github.com/oschwengers/bakta

Specify a path to a local copy of a BAKTA database.

type: string

If a local copy of a BAKTA database exists, specify the path to that database which is prepared in a BAKTA format. Otherwise this will be downloaded for you.

The contents of the directory should have files such as *.dmnd in the top level.

Download full or light version of the Bakta database if not supplying own database.

type: string

If you want the pipeline to download the Bakta database for you, you can choose between the full (33.1 GB) and light (1.3 GB) version. The full version is generally recommended for best annotation results, because it contains all of these:

UPS: unique protein sequences identified via length and MD5 hash digests (100% coverage & 100% sequence identity)
IPS: identical protein sequences comprising seeds of UniProt's UniRef100 protein sequence clusters
PSC: protein sequences clusters comprising seeds of UniProt's UniRef90 protein sequence clusters
PSCC: protein sequences clusters of clusters comprising annotations of UniProt's UniRef50 protein sequence clusters

If download bandwidth, storage, memory, or run duration requirements become an issue, go for the light version (which only contains PSCCs) by modifying the annotation_bakta_db_downloadtype flag.

More details can be found in the documentation

Modifies tool parameter(s):

BAKTA_DBDOWNLOAD: --type

Use the default genome-length optimised mode (rather than the metagenome mode).

type: boolean

By default, Bakta's --meta mode is used in the pipeline to improve the gene prediction of highly fragmented metagenomes.

By specifying this parameter Bakta will instead use its default mode that is optimised for singular 'complete' genome sequences.

More details can be found in the documentation.

Modifies tool parameter(s):

BAKTA: --meta

Specify the minimum contig size.

type: integer

default: 1

Specify the minimum contig size that would be annotated by BAKTA. If run with '--annotation_bakta_compliant', the minimum contig length must be set to 200. More details can be found in the documentation.

Modifies tool parameter(s):

BAKTA: --min-contig-length

Specify the genetic code translation table.

type: integer

default: 11

Specify the genetic code translation table used for translation of nucleotides to amino acids. All possible genetic codes (1-25) used for gene annotation can be found here. More details can be found in the documentation.

Modifies tool parameter(s):

BAKTA: --translation-table

Specify the type of bacteria to be annotated to detect signaling peptides.

type: string

Specify the type of bacteria expected in the input dataset for correct annotation of the signal peptide predictions. More details can be found in the documentation.

Modifies tool parameter(s):

BAKTA: --gram

Specify that all contigs are complete replicons.

type: boolean

This flag expects contigs that make up complete chromosomes and/or plasmids. By calling it, the user ensures that the contigs are complete replicons. More details can be found in the documentation.

Modifies tool parameter(s):

BAKTA: --complete

Changes the original contig headers.

type: boolean

This flag specifies that the contig headers should be rewritten. More details can be found in the documentation.

Modifies tool parameter(s):

BAKTA: --keep-contig-headers

Clean the result annotations to standardise them to Genbank/ENA conventions.

type: boolean

The resulting annotations are cleaned up to standardise them to Genbank/ENA/DDJB conventions. CDS without any attributed hits and those without gene symbols or product descriptions different from hypothetical will be marked as 'hypothetical'. When activated the --min-contig-length will be set to 200. More info can be found here.

Modifies tool parameter(s):

BAKTA: --compliant

Activate tRNA detection & annotation.

type: boolean

This flag activates tRNAscan-SE 2.0 that predicts tRNA genes. More details can be found in the documentation.

Modifies tool parameter(s):

BAKTA: --skip-trna

Activate tmRNA detection & annotation.

type: boolean

This flag activates Aragorn that predicts tmRNA genes. More details can be found in the documentation.

Modifies tool parameter(s):

BAKTA: --skip-tmrna `

Activate rRNA detection & annotation.

type: boolean

This flag activates Infernal vs. Rfam rRNA covariance models that predicts rRNA genes. More details can be found in the documentation.

Modifies tool parameter(s):

BAKTA: --rrna

Activate ncRNA detection & annotation.

type: boolean

This flag activates Infernal vs. Rfam ncRNA covariance models that predicts ncRNA genes. BAKTA distinguishes between ncRNA genes and (cis-regulatory) regions to enable the distinction of feature overlap detection. This includes distinguishing between ncRNA gene types: sRNA, antisense, ribozyme and antitoxin. More details can be found in the documentation.

Modifies tool parameter(s):

BAKTA: --ncrna

Activate ncRNA region detection & annotation.

type: boolean

This flag activates Infernal vs. Rfam ncRNA covariance models that predicts ncRNA cis-regulatory regions. BAKTA distinguishes between ncRNA genes and (cis-regulatory) regions to enable the distinction of feature overlap detection. This including distinguishing between ncRNA (cis-regulatory) region types: riboswitch, thermoregulator, leader and frameshift element. More details can be found in the documentation.

Modifies tool parameter(s):

BAKTA: --skip-ncrna-region

Activate CRISPR array detection & annotation.

type: boolean

This flag activates PILER-CR that predicts CRISPR arrays. More details can be found in the documentation.

Modifies tool parameter(s):

BAKTA: --skip-crispr

Skip CDS detection & annotation.

type: boolean

This flag skips CDS prediction that is done by PYRODIGAL with which the distinct prediction for complete replicons and uncompleted contigs is done. For more information on how BAKTA predicts CDS please refer to the BAKTA documentation.

Modifies tool parameter(s):

BAKTA: --skip-cds

Activate pseudogene detection & annotation.

type: boolean

This flag activates the search for reference Phytochelatin Synthase genes (PCSs) using 'hypothetical' CDS as seed sequences, then aligns the translated PCSs against up-/downstream-elongated CDS regions. More details can be found in the BAKTA documentation.

Modifies tool parameter(s):

BAKTA: --skip-pseudo

Skip sORF detection & annotation.

type: boolean

Skip the prediction of sORFs from amino acids stretches as less than 30aa. For more info please refer to BAKTA documentation. All sORF without gene symbols or product descriptions different from hypothetical will be discarded, while only those identified hits exhibiting proper gene symbols or product descriptions different from hypothetical will still be included in the final annotation.

Modifies tool parameter(s):

BAKTA: --skip-sorf

Activate gap detection & annotation.

type: boolean

Activates any gene annotation found within contig assembly gaps. More details can be found in the BAKTA documentation.

Modifies tool parameter(s):

BAKTA: --skip-gap

Activate oriC/oriT detection & annotation.

type: boolean

Activates the BAKTA search for oriC/oriT genes by comparing results from Blast+ (generated by cov=0.8, id=0.8) and the MOB-suite of oriT & DoriC oriC/oriV sequences. Annotations of ori regions take into account overlapping Blast+ hits and are conducted based on a majority vote heuristic. Region edges may be fuzzy. For more info please refer to the BAKTA documentation.

Modifies tool parameter(s):

BAKTA: --skip-ori

Activate generation of circular genome plots.

type: boolean

Activate this flag to generate genome plots (might be memory-intensive).

Modifies tool parameter(s):

BAKTA: --skip-plot

Supply a path of an HMM file of trusted hidden markov models in HMMER format for CDS annotation

type: string

Bakta accepts user-provided trusted HMMs via --hmms in HMMER's text format. If set, Bakta will adhere to the trusted cutoff specified in the HMM header. In addition, a max. evalue threshold of 1e-6 is applied. For more info please refer to the BAKTA documentation.

Modifies tool parameter(s):

BAKTA: --hmms

Prokka annotates genomic sequences belonging to bacterial, archaeal and viral genomes. More info: https://github.com/tseemann/prokka

Use the default genome-length optimised mode (rather than the metagenome mode).

type: boolean

By default, Prokka's --metagenome mode is used in the pipeline to improve the gene prediction of highly fragmented metagenomes.

By specifying this parameter Prokka will instead use its default mode that is optimised for singular 'complete' genome sequences.

For more information, please check the Prokka documentation.

Modifies tool parameter(s):

Prokka: --metagenome

Suppress the default clean-up of the gene annotations.

type: boolean

By default, annotation in Prokka is carried out by alignment to other proteins in its database, or the databases the user provides via the tools --proteins flag. The resulting annotations are then cleaned up to standardise them to Genbank/ENA conventions. 'Vague names' are set to 'hypothetical proteins', 'possible/probable/predicted' are set to 'putative' and 'EC/CPG and locus tag ids' are removed.

By supplying this flag you stop such clean up leaving the original annotation names.

For more information please check the Prokka documentation.

This flag suppresses this default behavior of Prokka (which is to perform the cleaning).

Modifies tool parameter(s):

Prokka: --rawproduct

Specify the kingdom that the input represents.

type: string

Specifies the kingdom that the input sample is derived from and/or you wish to screen for

⚠️ Prokka cannot annotate Eukaryotes.

For more information please check the Prokka documentation.

Modifies tool parameter(s):

Prokka: --kingdom

Specify the translation table used to annotate the sequences.

type: integer

default: 11

Specify the translation table used to annotate the sequences. All possible genetic codes (1-25) used for gene annotation can be found here. This flag is required if the flag --kingdom is assigned.

For more information please check the Prokka documentation.

Modifies tool parameter(s):

Prokka: --gcode

Minimum contig size required for annotation (bp).

type: integer

default: 1

Specify the minimum contig lengths to carry out annotations on. The Prokka developers recommend that this should be ≥ 200 bp, if you plan to submit such annotations to NCBI.

For more information please check the Prokka documentation.

Modifies tool parameter(s):

Prokka: --mincontiglen

E-value cut-off.

type: number

default: 0.000001

Specifiy the maximum E-value used for filtering the alignment hits.

For more information please check the Prokka documentation.

Modifies tool parameter(s):

Prokka: --evalue

Set the assigned minimum coverage.

type: integer

default: 80

Specify the minimum coverage percent of the annotated genome. This must be set between 0-100.

For more information please check the Prokka documentation.

Modifies tool parameter(s):

Prokka: --coverage

Allow transfer RNA (trRNA) to overlap coding sequences (CDS).

type: boolean

Allow transfer RNA (trRNA) to overlap coding sequences (CDS). Transfer RNAs are short stretches of nucleotide sequences that link mRNA and the amino acid sequence of proteins. Their presence helps in the annotation of the sequences, because each trRNA can only be attached to one type of amino acid.

For more information please check the Prokka documentation.

Modifies tool parameter(s):

Prokka: --cdsrnaolap

Use RNAmmer for rRNA prediction.

type: boolean

Activates RNAmmer instead of the Prokka default Barrnap for rRNA prediction during the annotation process. RNAmmer classifies ribosomal RNA genes in genome sequences by using two levels of Hidden Markov Models. Barrnap uses the nhmmer tool that includes HMMER 3.1 for HMM searching in RNA:DNA style.

For more information please check the Prokka documentation.

Modifies tool parameter(s):

Prokka: --rnammer

Force contig name to Genbank/ENA/DDJB naming rules.

type: boolean

default: true

Force the contig headers to conform to the Genbank/ENA/DDJB contig header standards. This is activated in combination with --centre [X] when contig headers supplied by the user are non-conforming and therefore need to be renamed before Prokka can start annotation. This flag activates --genes --mincontiglen 200. For more information please check the Prokka documentation.

Modifies tool parameter(s):

Prokka: --compliant

Add the gene features for each CDS hit.

type: boolean

For every CDS annotated, this flag adds the gene that encodes for that CDS region. For more information please check the Prokka documentation.

Modifies tool parameter(s):

Prokka: --addgenes

Retains contig names.

type: boolean

This parameter allows prokka to retain the original contig names by activating PROKKA's --force flag. If this parameter is set to false it activates PROKKA's flags --locus-tag PROKKA --centre CENTER so the locus tags (contig names) will be PROKKA_# and the center tag will be CENTER. By default PROKKA changes contig headers to avoid errors that might rise due to long contig headers, so this must be turned on if the user has short contig names that should be retained by PROKKA.

Modifies tool parameter(s):

Prokka: --locus-tag PROKKA --centre CENTER

Prokka: --force

Prodigal is a protein-coding gene prediction tool developed to run on bacterial and archaeal genomes. More info: https://github.com/hyattpd/prodigal/wiki

Specify whether to use Prodigal's single-genome mode for long sequences.

type: boolean

By default Prodigal runs in 'single genome' mode that requires sequence lengths to be equal or longer than 20000 characters.

However, more fragmented reads from MAGs often result in contigs shorter than this. Therefore, nf-core/funcscan will run with the meta mode by default. Providing this parameter allows to override this and run in single genome mode again.

For more information check the Prodigal documentation.

Modifies tool parameter(s): -PRODIGAL: -p

Does not allow partial genes on contig edges.

type: boolean

Suppresses partial genes from being on contig edge, resulting in closed ends. Should only be activated for genomes where it is sure the first and last bases of the sequence(s) do not fall inside a gene. Run together with -p normal (former -p single) .

For more information check the Prodigal documentation.

Modifies tool parameter(s):

PRODIGAL: -c

Specifies the translation table used for gene annotation.

type: integer

default: 11

Specifies which translation table should be used for seqeunce annotation. All possible genetic code translation tables can be found here. The default is set at 11, which is used for standard Bacteria/Archeae.

For more information check the Prodigal documentation.

Modifies tool parameter(s):

PRODIGAL: -g

Forces Prodigal to scan for motifs.

type: boolean

Forces PRODIGAL to a full scan for motifs rather than activating the Shine-Dalgarno RBS finder, the default scanner for PRODIGAL to train for motifs.

For more information check the Prodigal documentation.

Modifies tool parameter(s):

PRODIGAL: -n

Pyrodigal is a resource-optimized wrapper around Prodigal, producing protein-coding gene predictions of bacterial and archaeal genomes. Read more at the Pyrodigal GitHub repository (https://github.com/althonos/pyrodigal) or its documentation (https://pyrodigal.readthedocs.io).

Specify whether to use Pyrodigal's single-genome mode for long sequences.

type: boolean

By default Pyrodigal runs in 'single genome' mode that requires sequence lengths to be equal or longer than 20000 characters.

However, more fragmented reads from MAGs often result in contigs shorter than this. Therefore, nf-core/funcscan will run with the meta mode by default, but providing this parameter allows to override this and run in single genome mode again.

For more information check the Pyrodigal documentation.

Modifies tool parameter(s):

PYRODIGAL: -p

Does not allow partial genes on contig edges.

type: boolean

Suppresses partial genes from being on contig edge, resulting in closed ends. Should only be activated for genomes where it is sure the first and last bases of the sequence(s) do not fall inside a gene. Run together with -p single .

For more information check the Pyrodigal documentation.

Modifies tool parameter(s):

PYRODIGAL: -c

Specifies the translation table used for gene annotation.

type: integer

default: 11

Specifies which translation table should be used for seqeunce annotation. All possible genetic code translation tables can be found here. The default is set at 11, which is used for standard Bacteria/Archeae.

For more information check the Pyrodigal documentation.

Modifies tool parameter(s):

PYRODIGAL: -g

Forces Pyrodigal to scan for motifs.

type: boolean

Forces Pyrodigal to a full scan for motifs rather than activating the Shine-Dalgarno RBS finder, the default scanner for Pyrodigal to train for motifs.

For more information check the Pyrodigal documentation.

Modifies tool parameter(s):

PYRODIGAL: -n

This forces Pyrodigal to append asterisks (*) as stop codon indicators. Do not use when running AMP workflow.

type: boolean

Some downstream tools like AMPlify cannot process sequences containing non-sequence characters like the stop codon indicator *. Thus, this flag is deactivated by default. Activate this flag to revert the behaviour and have Pyrodigal append * as stop codon indicator to annotated sequences.

For more information check the Pyrodigal documentation.

Modifies tool parameter(s):

PYRODIGAL: --no-stop-codon

Functionally annotates all annotated coding regions.

Activates the functional annotation of annotated coding regions to provide more information about the codon regions classified.

type: boolean

Activates the annotation of annotated coding regions.

Specifies the tool used for further protein annotation.

type: string

This flag specifies which tool for protein annotation should be activated. At the moment only InterProScan is incorporated in the pipeline. This annotates the locus tags to protein and domain levels according to the InterPro databases.

More details can be found in the tool documentation.

Change the database version used for annotation.

type: string

default: https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.72-103.0/interproscan-5.72-103.0-64-bit.tar.gz

This allows the user to change the InterProScan database version that the pipeline will download for you automatically. To instead use a pre-downloaded database, please supply its path to --protein_annotation_interproscan_db. Changing this URL allows for the use of the latest database release. By default this is set to http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.72-103.0/interproscan-5.72-103.0-64-bit.tar.gz.

Path to pre-downloaded InterProScan database.

type: string

Use this to supply the path to a pre-downloaded InterProScan database. This can be any unzipped InterProScan version.

For more details on where to find different InterProScan databases see tool documentation.

Assigns the database(s) to be used to annotate the coding regions.

type: string

default: PANTHER,ProSiteProfiles,ProSitePatterns,Pfam

pattern: ^\w+(,\w+)*

A comma-separated string specifying the database(s) to be used to annotate the coding regions annotated during the contig annotation workflow of the pipeline. By default these include PANTHER,ProSiteProfiles,ProSitePatterns,Pfam.

PANTHER (Protein ANalysis THrough Evolutionary Relationships): genes classified by their functions, using published scientific experimental evidence and evolutionary relationships.
PROSITE: protein domains, families, functional sites and specific patterns and profiles to identify them.
PFAM: protein families, represented by multiple sequence alignments and hidden Markov models (HMMs).

These databases were chosen based on the AMP workflow and therefore, with only these databases, do we guarantee the integration of the results to the AMPcombi final summary.

NOTE: Currently, no integration of the results are implemented for the BGC and the ARG final summary tables.

For more information about all possible databases see the tool documentation.

Modifies tool parameter(s):

InterProScan: --applications

Pre-calculates residue mutual matches.

type: boolean

This increases the speed of functional annotation with InterProScan by pre-calculating matches found in the UniProtKB, thereby identifying unique matches in the query sequences for faster annotation. By default this is turned off.

For more information about this flag see the tool documentation.

Modifies tool parameter(s):

InterProScan: ---diasable-precalc

General options for database downloading

Specify whether to save pipeline-downloaded databases in your results directory.

type: boolean

While nf-core/funcscan can download databases for you, often these are very large and can significantly slow-down pipeline runtime if the databases have to be downloaded every run.

Specifying --save_db will save the pipeline-downloaded databases in your results directory. This applies to: AMRFinderPlus, antiSMASH, Bakta, CARD (for RGI), DeepARG, DeepBGC, and DRAMP (for AMPcombi2).

You can then move the resulting directories/files to a central cache directory of your choice for re-use in the future.

If you do not specify these flags, the database files will remain in your work/ directory and will be deleted if cleanup = true is specified in your config, or if you run nextflow clean.

Antimicrobial Peptide detection using a deep learning model. More info: https://github.com/bcgsc/AMPlify

Skip AMPlify during AMP screening.

type: boolean

Antimicrobial Peptide detection using machine learning. ampir uses a supervised statistical machine learning approach to predict AMPs. It incorporates two support vector machine classification models, 'precursor' and 'mature' that have been trained on publicly available antimicrobial peptide data. More info: https://github.com/Legana/ampir

Skip ampir during AMP screening.

type: boolean

Specify which machine learning classification model to use.

type: string

Ampir uses a supervised statistical machine learning approach to predict AMPs. It incorporates two support vector machine classification models, "precursor" and "mature".

The precursor module is better for predicted proteins from a translated transcriptome or translated gene models. The alternative model (mature) is best suited for AMP sequences after post-translational processing, typically from direct proteomic sequencing.

More information can be found in the ampir documentation.

Modifies tool parameter(s):

AMPir: model =

Specify minimum protein length for prediction calculation.

type: integer

default: 10

Filters result for minimum protein length. Note that amino acid sequences that are shorter than 10 amino acids long and/or contain anything other than the standard 20 amino acids are not evaluated and will contain an NA as their "prob_AMP value."

More information can be found in the ampir documentation.

Modifies tool parameter(s):

AMPir parameter: min_length in the calculate_features() function

Antimicrobial Peptide detection based on predefined HMM models. This tool implements methods using probabilistic models called profile hidden Markov models (profile HMMs) to search against a sequence database. More info: http://eddylab.org/software/hmmer/Userguide.pdf

Run hmmsearch during AMP screening.

type: boolean

hmmsearch is not run by default because HMM model files must be provided by the user with the flag amp_hmmsearch_models.

Specify path to the AMP hmm model file(s) to search against. Must have quotes if wildcard used.

type: string

hmmsearch performs biosequence analysis using profile hidden Markov Models. The models are specified in.hmm files that are specified with this parameter

e.g.

--amp_hmmsearch_models '/<path>/<to>/<models>/*.hmm'

You must wrap the path in quotes if you use a wildcard, to ensure Nextflow expansion not bash! When using quotes, the absolute path to the HMM file(s) has to be given.

For more information check the HMMER documentation.

Saves a multiple alignment of all significant hits to a file.

type: boolean

Save a multiple alignment of all significant hits (those satisfying inclusion thresholds) to a file

For more information check the HMMER documentation.

Modifies tool parameter(s):

hmmsearch: -A

Save a simple tabular file summarising the per-target output.

type: boolean

Save a simple tabular (space-delimited) file summarizing the per-target output, with one data line per homologous target sequence found.

For more information check the HMMER documentation.

Modifies tool parameter(s)

hmmsearch: --tblout

Save a simple tabular file summarising the per-domain output.

type: boolean

Save a simple tabular (space-delimited) file summarizing the per-domain output, with one data line per homologous domain detected in a query sequence for each homologous model.

For more information check the HMMER documentation.

Modifies tool parameter(s):

hmmsearch: --domtblout

Antimicrobial peptide detection from metagenomes. More info: https://github.com/BigDataBiology/macrel

Skip Macrel during AMP screening.

type: boolean

Antimicrobial peptides parsing, filtering, and annotating submodule of AMPcombi2. More info: https://github.com/Darcy220606/AMPcombi

The name of the database used to classify the AMPs.

type: string

AMPcombi can use three different AMP databases to classify the recovered AMPS. These can either be:

DRAMP database: Only general AMPs are downloaded and filtered to remove any entry that has an instance of non amino acid residues in their sequence.
APD: Only experimentally validated AMPs are present.
UniRef100: Combines a more general protein dataset including curated and non curated AMPs. Helpful for identifying the clusters to remove any potential false positives. Beware: If the thresholds are for ampcombi are not strict enough, alignment with this database can take a long time.

By default this is set to 'DRAMP'. Other valid options include 'APD' or 'UniRef100'.

For more information check the AMPcombi documentation.

The path to the folder containing the reference database files.

type: string

The path to the folder containing the reference database files (*.fasta and *.tsv); a fasta file and the corresponding table with structural, functional and if reported taxonomic classifications. AMPcombi will then generate the corresponding mmseqs2 directory, in which all binary files are prepared for the downstream alignment of teh recovered AMPs with MMseqs2. These can also be provided by the user by setting up an mmseqs2 compatible database using mmseqs createdb *.fasta in a directory called mmseqs2.

Example file structure for the reference database supplied by the user:

amp_DRAMP_database/
├── general_amps_2024_11_13.fasta
├── general_amps_2024_11_13.txt
└── mmseqs2
    ├── ref_DB
    ├── ref_DB.dbtype
    ├── ref_DB_h
    ├── ref_DB_h.dbtype
    ├── ref_DB_h.index
    ├── ref_DB.index
    ├── ref_DB.lookup
    └── ref_DB.source

For more information check the AMPcombi [documentation](https://ampcombi.readthedocs.io/en/main/usage.html#parse-tables).

Specifies the prediction tools' cut-offs.

type: number

default: 0.6

This converts any prediction score below this cut-off to '0'. By doing so only values above this value will be used in the final AMPcombi2 summary table. This applies to all prediction tools except for hmmsearch, which uses e-value. To change the e-value cut-off use instead --amp_ampcombi_parsetables_hmmevalue.

Modifies tool parameter(s):

AMPCOMBI: --amp_cutoff

Filter out all amino acid fragments shorter than this number.

type: integer

default: 120

Any AMP hit that does not satisfy this length cut-off will be removed from the final AMPcombi2 summary table.

Modifies tool parameter(s):

AMPCOMBI: --aminoacid_length

Remove all DRAMP annotations that have an e-value greater than this value.

type: number

default: 5

This e-value is used as a cut-off for the annotations from the internal Diamond alignment step (against the DRAMP database by default). Any e-value below this value will only remove the DRAMP classification and not the entire hit.

Modifies tool parameter(s):

AMPCOMBI: --db_evalue

Retain HMM hits that have an e-value lower than this.

type: number

default: 0.06

This converts any prediction score below this cut-off to '0'. By doing so only values above this value will be used in the final AMPcombi2 summary table. To change the prediction score cut-off for all other AMP prediction tools, use instead --amp_cutoff.

Modifies tool parameter(s):

AMPCOMBI: --hmm_evalue

Assign the number of codons used to look for stop codons, upstream and downstream of the AMP hit.

type: integer

default: 60

This assigns the length of the window size required to look for stop codons downstream and upstream of the CDS hits. In the default case, it looks 60 codons downstream and upstream of the AMP hit and reports whether a stop codon was found.

Modifies tool parameter(s):

AMPCOMBI: --window_size_stop_codon

Assign the number of CDSs upstream and downstream of the AMP to look for a transport protein.

type: integer

default: 11

This assigns the length of the window size required to look for a 'transporter' (e.g. ABC transporter) downstream and upstream of the CDS hits. This is done on CDS classification level.

Modifies tool parameter(s):

AMPCOMBI: --window_size_transporter

Remove hits that have no stop codon upstream and downstream of the AMP.

type: boolean

Removes any hits/CDSs that don't have a stop codon found in the window downstream or upstream of the CDS assigned by --amp_ampcombi_parsetables_windowstopcodon. We recommend to turn it on if the results will be used downstream experimentally.

Modifies tool parameter(s):

AMPCOMBI: --remove_stop_codons

Assigns the file extension used to identify AMPIR output.

type: string

default: .ampir.tsv

Assigns the file extension of the input files to allow AMPcombi2 to identify the tool output from the list of input files.

Modifies tool parameter(s):

AMPCOMBI: --ampir_file

Assigns the file extension used to identify AMPLIFY output.

type: string

default: .amplify.tsv

Assigns the file extension of the input files to allow AMPcombi2 to identify the tool output from the list of input files.

Modifies tool parameter(s):

AMPCOMBI: --amplify_file

Assigns the file extension used to identify MACREL output.

type: string

default: .macrel.prediction

Assigns the file extension of the input files to allow AMPcombi2 to identify the tool output from the list of input files.

Modifies tool parameter(s):

AMPCOMBI: --macrel_file

Assigns the file extension used to identify HMMER/HMMSEARCH output.

type: string

default: .hmmer_hmmsearch.txt

Assigns the file extension of the input files to allow AMPcombi2 to identify the tool output from the list of input files.

Modifies tool parameter(s):

AMPCOMBI: --hmmsearch_file

Clusters the AMP candidates identified with AMPcombi. More info: https://github.com/Darcy220606/AMPcombi

MMseqs2 coverage mode.

type: number

This assigns the coverage mode to the MMseqs2 cluster module. This determines how AMPs are grouped into clusters. More details can be found in the MMseqs2 documentation.

Modifies tool parameter(s):

AMPCOMBI: --cluster_cov_mode

Remove hits that have no stop codon upstream and downstream of the AMP.

type: number

default: 4

This assigns the sensitivity of alignment to the MMseqs2 cluster module. This determines how AMPs are grouped into clusters. More information can be obtained in the MMseqs2 documentation.

Modifies tool parameter(s):

AMPCOMBI: --cluster_sensitivity

Remove clusters that don't have more AMP hits than this number.

type: integer

Removes all clusters with this number of AMP hits and less.

Modifies tool parameter(s):

AMPCOMBI: --cluster_min_member

MMseqs2 clustering mode.

type: number

default: 1

This assigns the cluster mode to the MMseqs2 cluster module. This determines how AMPs are grouped into clusters. More information can be obtained in the MMseqs2 documentation.

Modifies tool parameter(s):

AMPCOMBI: --cluster_mode

MMseqs2 alignment coverage.

type: number

default: 0.8

This assigns the coverage to the MMseqs2 cluster module. This determines how AMPs are grouped into clusters. More information can be obtained inMMseqs2 documentation.

Modifies tool parameter(s):

AMPCOMBI: --cluster_coverage

MMseqs2 sequence identity.

type: number

default: 0.4

This assigns the cluster sequence identity to the MMseqs2 cluster module. This determines how AMPs are grouped into clusters. More information can be obtained in the MMseqs2 documentation.

Modifies tool parameter(s):

AMPCOMBI: --cluster_seq_id

Remove any hits that form a single member cluster.

type: boolean

Removes any AMP hits that form a single-member cluster.

Modifies tool parameter(s):

AMPCOMBI: --cluster_remove_singletons

Antimicrobial resistance gene detection based on NCBI's curated Reference Gene Database and curated collection of Hidden Markov Models. identifies AMR genes, resistance-associated point mutations, and select other classes of genes using protein annotations and/or assembled nucleotide sequences. More info: https://github.com/ncbi/amr/wiki

Skip AMRFinderPlus during the ARG screening.

type: boolean

Specify the path to a local version of the ARMFinderPlus database.

type: string

Specify the path to a local version of the ARMFinderPlus database.

You must give the latest directory to the pipeline, and the contents of the directory should include files such as *.nbd, *.nhr, versions.txt etc. in the top level.

If no input is given, the pipeline will download the database for you.

See the nf-core/funcscan usage documentation for more information.

Modifies tool parameter(s):

AMRFinderPlus: --database

Minimum percent identity to reference sequence.

type: number

default: -1

Specify the minimum percentage amino-acid identity to reference protein or nucleotide identity for nucleotide reference must have if a BLAST alignment (based on methods: BLAST or PARTIAL) was detected, otherwise NA.

If you specify -1, this means use a curated threshold if it exists and 0.9 otherwise.

Setting this value to something other than -1 will override any curated similarity cutoffs. For BLAST: alignment is > 90% of length and > 90% identity to a protein in the AMRFinderPlus database. For PARTIAL: alignment is > 50% of length, but < 90% of length and > 90% identity to the reference, and does not end at a contig boundary.

For more information check the AMRFinderPlus documentation.

Modifies tool parameter(s):

AMRFinderPlus: --ident_min

Minimum coverage of the reference protein.

type: number

default: 0.5

Minimum proportion of reference gene covered for a BLAST-based hit analysis if a BLAST alignment was detected, otherwise NA.

For BLAST-based hit analysis: alignment is > 90% of length and > 90% identity to a protein in the AMRFinderPlus database or for PARTIAL: alignment is > 50% of length, but < 90% of length and > 90% identity to the reference, and does not end at a contig boundary.

For more information check the AMRFinderPlus documentation.

Modifies tool parameter(s):

AMRFinderPlus: --coverage_min

Specify which NCBI genetic code to use for translated BLAST.

type: integer

default: 11

NCBI genetic code for translated BLAST. Number from 1 to 33 to represent the translation table used for BLASTX.

See translation table for more details on which table to use.

For more information check the AMRFinderPlus documentation.

Modifies tool parameter(s):

AMRFinderPlus: --translation_table

Add the plus genes to the report.

type: boolean

Provide results from "Plus" genes in the output files.

Mostly the plus genes are an expanded set of genes that are of interest in pathogens. This set includes stress response (biocide, metal, and heat resistance), virulence factors, some antigens, and porins. These "plus" proteins have primarily been added to the database with curated BLAST cutoffs, and are generally identified by BLAST searches. Some of these may not be acquired genes or mutations, but may be intrinsic in some organisms. See AMRFinderPlus database for more details.

Modifies tool parameter(s):

AMRFinderPlus: --plus

Add identified column to AMRFinderPlus output.

type: boolean

Prepend a column containing an identifier for this run of AMRFinderPlus. For example this can be used to add a sample name column to the AMRFinderPlus results. If set to true, the --name <identifier> is the sample name.

Modifies tool parameter(s):

AMRFinderPlus: --name

Antimicrobial resistance gene detection using a deep learning model. DeepARG is composed of two models for two types of input: short sequence reads and gene-like sequences. In this pipeline we use the ls model, which is suitable for annotating full sequence genes and to discover novel antibiotic resistance genes from assembled samples. The tool Diamond is used as an aligner. More info: https://bitbucket.org/gusphdproj/deeparg-ss/src/master

Skip DeepARG during the ARG screening.

type: boolean

Specify the path to the DeepARG database.

type: string

Specify the path to a local version of the DeepARG database (see the pipelines' usage documentation).

The contents of the directory should include directories such as database, moderl, and files such as deeparg.gz etc. in the top level.

If no input is given, the module will download the database for you, however this is not recommended, as the database is large and this will take time.

Modifies tool parameter(s):

DeepARG: --data-path

Specify the numeric version number of a user supplied DeepaRG database.

type: integer

default: 2

The DeepARG tool itself does not report explicitly the database version it uses. We assume the latest version (as downloaded by the tool's database download module), however if you supply a different database, you must supply the version with this parameter for use with the downstream hAMRonization tool.

The version number must be without any leading v etc.

Specify which model to use (short or long sequences).

type: string

Specify which model to use: short sequences for reads (SS), or long sequences for genes (LS). In the vast majority of cases we recommend using the LS model when using funcscan

For more information check the DeepARG documentation.

Modifies tool parameter(s):

DeepARG: --model

Specify minimum probability cutoff under which hits are discarded.

type: number

default: 0.8

Sets the minimum probability cutoff below which hits are discarded.

For more information check the DeepARG documentation.

Modifies tool parameter(s):

DeepARG: --min-prob

Specify E-value cutoff under which hits are discarded.

type: number

default: 1e-10

Sets the cutoff value for Evalue below which hits are discarded.

For more information check the DeepARG documentation.

Modifies tool parameter(s):

DeepARG: --arg-alignment-evalue

Specify percent identity cutoff for sequence alignment under which hits are discarded.

type: integer

default: 50

Sets the value for Identity cutoff for sequence alignment.

For more information check the DeepARG documentation.

Modifies tool parameter(s):

DeepARG: --arg-alignment-identity

Specify alignment read overlap.

type: number

default: 0.8

Sets the value for the allowed alignment read overlap.

For more information check the DeepARG documentation.

Modifies tool parameter(s):

DeepARG: --arg-alignment-overlap

Specify minimum number of alignments per entry for DIAMOND step of DeepARG.

type: integer

default: 1000

Sets the value of minimum number of alignments per entry for DIAMOND.

For more information check the DeepARG documentation.

Modifies tool parameter(s):

DeepARG: --arg-num-alignments-per-entry

Antimicrobial resistance gene detection using a deep learning model. The tool includes developed and optimised models for a number or resistance gene types, and the functionality to create and optimize models of your own choice of resistance genes. More info: https://github.com/fannyhb/fargene

Skip fARGene during the ARG screening.

type: boolean

Specify comma-separated list of which pre-defined HMM models to screen against

type: string

default: class_a,class_b_1_2,class_b_3,class_c,class_d_1,class_d_2,qnr,tet_efflux,tet_rpg,tet_enzyme

pattern:

^(class_a|class_b_1_2|class_b_3|class_c|class_d_1|class_d_2|qnr|tet_efflux|tet_rpg|tet_enzyme)(,(class_a|class_b_1_2|class_b_3|class_c|class_d_1|class_d_2|qnr|tet_efflux|tet_rpg|tet_enzyme))*$

Specify via a comma separated list any of the hmm-models of the pre-defined models:

Class A beta-lactamases: class_a
Subclass B1 and B2 beta-lactamases: class_b_1_2
Subclass B3 beta-lactamases: class_b_3
Class C beta-lactamases: class_c
Class D beta-lactamases: class_d_1, class_d_2
qnr: qnr
Tetracycline resistance genes tet_efflux, tet_rpg, tet_enzyme

For more information check the fARGene documentation.

For example: --arg_fargenemodel 'class_a,qnr,tet_enzyme'

Modifies tool parameter(s):

fARGene: --hmm-model

Specify to save intermediate temporary files to results directory.

type: boolean

fARGene generates many additional temporary files which in most cases won't be useful and thus by default are not saved to the pipeline's result directory.

By specifying this parameter, the directories tmpdir/, hmmsearchresults/ and spades_assemblies/ will be also saved in the output directory for closer inspection by the user, if necessary.

The threshold score for a sequence to be classified as a (almost) complete gene.

type: number

The threshold score for a sequence to be classified as a (almost) complete gene. If not pre-assigned, it is assigned by the hmm_model used based on the trade-off between sensitivity and specificity.

For more details see code documentation.

Modifies tool parameter(s):

fARGene: --score

The minimum length of a predicted ORF retrieved from annotating the nucleotide sequences.

type: integer

default: 90

The minimum length of a predicted ORF retrieved from annotating the nucleotide sequences. By default the pipeline assigns this to 90% of the assigned hmm_model sequence length.

For more information check the fARGene documentation.

Modifies tool parameter(s):

fARGene: --min-orf-length

Defines which ORF finding algorithm to use.

type: boolean

By default, pipeline uses prodigal/prokka for the prediction of ORFs from nucleotide sequences. Another option is the NCBI ORFfinder tool that is built into fARGene, the use of which is activated by this flag.

For more information check the fARGene documentation.

Modifies tool parameter(s):

fARGene: --orf-finder

The translation table/format to use for sequence annotation.

type: string

default: pearson

The translation format that transeq should use for amino acid annotation from the nucleotide sequences. More sequence formats can be found in transeq 'input sequence formats'.

For more information check the fARGene documentation.

Modifies tool parameter(s):

fARGene: --translation-format

Antimicrobial resistance gene detection, based on alignment to the CARD database based on homology and SNP models. More info: https://github.com/arpcard/rgi

Skip RGI during the ARG screening.

type: boolean

Path to user-defined local CARD database.

type: string

You can pre-download the CARD database to your machine and pass the path of it to this parameter.

The contents of the directory should include files such as card.json, aro_index.tsv, snps.txt etc. in the top level.

See the pipeline documentation for details on how to download this.

Modifies tool parameter(s):

RGI_CARDANNOTATION: --input

Save RGI output .json file.

type: boolean

When activated, this flag saves the .json file in the RGI output directory. The .json file contains the ARG predictions in a format that can be can be uploaded to the CARD website for visualization. See RGI documentation for more details. By default, the .json file is generated in the working directory but not saved in the results directory to save disk space (.json file is quite large and not required downstream in the pipeline).

Specify to save intermediate temporary files in the results directory.

type: boolean

RGI generates many additional temporary files which in most cases won't be useful, thus are not saved by default.

By specifying this parameter, files including temp in their name will be also saved in the output directory for closer inspection by the user.

Specify the alignment tool to be used.

type: string

Specifies the alignment tool to be used. By default RGI runs BLAST and this is also set as default in the nf-core/funcscan pipeline. With this flag the user can choose between BLAST and DIAMOND for the alignment step.

For more information check the RGI documentation.

Modifies tool parameter(s):

RGI_MAIN: --alignment_tool

Include all of loose, strict and perfect hits (i.e. ≥ 95% identity) found by RGI.

type: boolean

When activated RGI output will include 'Loose' hits in addition to 'Strict' and 'Perfect' hits. The 'Loose' algorithm works outside of the detection model cut-offs to provide detection of new, emergent threats and more distant homologs of AMR genes, but will also catalog homologous sequences and spurious partial matches that may not have a role in AMR.

For more information check the RGI documentation.

Modifies tool parameter(s):

RGI_MAIN: --include_loose

Suppresses the default behaviour of RGI with --arg_rgi_includeloose.

type: boolean

This flag suppresses the default behaviour of RGI, by listing all 'Loose' matches of ≥ 95% identity as 'Strict' or 'Perfect', regardless of alignment length.

For more information check the RGI documentation.

Modifies tool parameter(s):

RGI_MAIN: --include_nudge

Include screening of low quality contigs for partial genes.

type: boolean

This flag should be used only when the contigs are of poor quality (e.g. short) to predict partial genes.

For more information check the RGI documentation.

Modifies tool parameter(s):

RGI_MAIN: --low_quality

Specify a more specific data-type of input (e.g. plasmid, chromosome).

type: string

This flag is used to specify the data type used as input to RGI. By default this is set as 'NA', which makes no assumptions on input data.

For more information check the RGI documentation.

Modifies tool parameter(s):

RGI_MAIN: --data

Run multiple prodigal jobs simultaneously for contigs in a fasta file.

type: boolean

default: true

For more information check the RGI documentation.

Modifies tool parameter:

RGI_MAIN: --split_prodigal_jobs

Antimicrobial resistance gene detection based on alignment to CBI, CARD, ARG-ANNOT, ResFinder, MEGARES, EcOH, PlasmidFinder, Ecoli_VF and VFDB. More info: https://github.com/tseemann/abricate

Skip ABRicate during the ARG screening.

type: boolean

Specify the name of the ABRicate database to use. Names of non-default databases can be supplied if --arg_abricate_db provided.

type: string

default: ncbi

Specifies which database to use from dedicated list of databases available by ABRicate.

Default supported are one of: argannot, card, ecoh, ecoli_vf, megares, ncbi, plasmidfinder, resfinder, vfdb. Other options can be supplied if you have installed a custom one within the directory you have supplied to --arg_abricate_db.

For more information check the ABRicate documentation.

Modifies tool parameter(s):

ABRicate: --db

Path to user-defined local ABRicate database directory for using custom databases.

type: string

Supply this only if you want to use additional custom databases you yourself have added to your ABRicate installation following the instructions here.

The contents of the directory should have a directory named with the database name in the top level (e.g. bacmet2/).

You must also specify the name of the custom database with --arg_abricate_db_id.

Modifies tool parameter(s):

ABRicate: --datadir

Minimum percent identity of alignment required for a hit to be considered.

type: integer

default: 80

Specifies the minimum percent identity used to classify an ARG hit using BLAST alignment.

For more information check the ABRicate documentation.

Modifies tool parameter(s):

ABRicate: --minid

Minimum percent coverage of alignment required for a hit to be considered.

type: integer

default: 80

Specifies the minimum coverage of the nucleotide sequence to be assigned an ARG hit using BLAST alignment. In the ABRicate matrix, an absent gene is assigned (.) and if present, it is assigned the estimated coverage (#).

For more information check the ABRicate documentation.

Modifies tool parameter(s):

ABRicate: --mincov

Influences parameters required for the ARG summary by hAMRonization.

Specifies summary output format.

type: string

Specifies which summary report format to apply with hamronize summarize: tsv, json or interactive (html)

Modifies tool parameter(s)

hamronize summarize: -t, --summary_type

Influences parameters required for the normalization of ARG annotations by argNorm. More info: https://github.com/BigDataBiology/argNorm

Skip argNorm during ARG screening.

type: boolean

These parameters influence general BGC settings like minimum input sequence length.

Specify the minimum length of contigs that go into BGC screening.

type: integer

default: 3000

Specify the minimum length of contigs that go into BGC screening.

If BGC screening is turned on, nf-core/funcscan will generate for each input sample a second FASTA file of only contigs that are longer than the specified minimum length. This is due to an (approximate) 'biological' minimum length that nucleotide sequences would need to have to code for a valid BGC (e.g. not on the edge of a contig), as well as to speeding up BGC screening sections of the pipeline by screening only meaningful contigs.

Note this only affects BGCs. For ARG and AMPs no filtering is performed and all contigs are screened.

Specify to save the length-filtered (unannotated) FASTAs used for BGC screening.

type: boolean

Biosynthetic gene cluster detection. More info: https://docs.antismash.secondarymetabolites.org

Skip antiSMASH during the BGC screening.

type: boolean

Path to user-defined local antiSMASH database.

type: string

It is recommend to pre-download the antiSMASH databases to your machine and pass the path of it to this parameter, as this can take a long time to download - particularly when running lots of pipeline runs.

The contents of the database directory should include directories such as as-js/, clusterblast/, clustercompare/ etc. in the top level.

See the pipeline documentation for details on how to download this. If running with docker or singularity, please also check --bgc_antismash_installdir for important information.

Path to user-defined local antiSMASH directory. Only required when running with docker/singularity.

type: string

This is required when running with docker and singularity (not required for conda), due to attempted 'modifications' of files during database checks in the installation directory, something that cannot be done in immutable docker/singularity containers.

Therefore, a local installation directory needs to be mounted (including all modified files from the downloading step) to the container as a workaround.

The contents of the installation directory should include directories such as common/ config/ and files such as custom_typing.py custom_typing.pyi etc. in the top level.

See the pipeline documentation for details on how to download this. If running with docker or singularity, please also check --bgc_antismash_installdir for important information.

Minimum length a contig must have to be screened with antiSMASH.

type: integer

default: 3000

This specifies the minimum length that a contig must have for the contig to be screened by antiSMASH.

For more information see the antiSMASH documentation.

This will only apply to samples that are screened with antiSMASH (i.e., those samples that have not been removed by --bgc_antismash_sampleminlength).

You may wish to increase this value compared to that of --bgc_antismash_sampleminlength, in cases where you wish to screen higher-quality (i.e. longer) contigs, or speed up runs by not screening lower quality/less informative contigs.

Modifies tool parameter(s):

antiSMASH: --minlength

Turn on clusterblast comparison against database of antiSMASH-predicted clusters.

type: boolean

Compare identified clusters against a database of antiSMASH-predicted clusters using the clusterblast algorithm.

For more information see the antiSMASH documentation.

Modifies tool parameter(s):

antiSMASH: --cb-general

Turn on clusterblast comparison against known gene clusters from the MIBiG database.

type: boolean

This will turn on comparing identified clusters against known gene clusters from the MIBiG database using the clusterblast algorithm.

MIBiG is a curated database of experimentally characterised gene clusters and with rich associated metadata.

For more information see the antiSMASH documentation.

Modifies tool parameter(s):

antiSMASH: --cb-knownclusters

Turn on clusterblast comparison against known subclusters responsible for synthesising precursors.

type: boolean

Turn on additional screening for operons involved in the biosynthesis of early secondary metabolites components using the clusterblast algorithm.

For more information see the antiSMASH documentation.

Modifies tool parameter(s):

antiSMASH: --cb-subclusters

Turn on ClusterCompare comparison against known gene clusters from the MIBiG database.

type: boolean

Turn on comparison of detected genes against the MIBiG database using the ClusterCompare algorithm - an alternative to clusterblast.

Note there will not be a dedicated ClusterCompare output in the antiSMASH results directory, but is present in the HTML.

For more information see the antiSMASH documentation.

Modifies tool parameter(s):

antiSMASH: --cc-mibig

Generate phylogenetic trees of secondary metabolite group orthologs.

type: boolean

Turning this on will activate the generation of additional functional and phylogenetic analysis of genes, via comparison against databases of protein orthologs.

For more information see the antiSMASH documentation.

Modifies tool parameter(s):

antiSMASH: --cb-smcog-trees

Defines which level of strictness to use for HMM-based cluster detection.

type: string

Levels of strictness correspond to screening different groups of 'how well-defined' clusters are. For example, loose will include screening for 'poorly defined' clusters (e.g. saccharides), relaxed for partially present clusters (e.g. certain types of NRPS), whereas strict will screen for well-defined clusters such as Ketosynthases.

You can see the rules for the levels of strictness here.

For more information see the antiSMASH documentation.

Modifies tool parameter(s):

antiSMASH: --hmmdetection-strictness

Run Pfam to Gene Ontology mapping module.

type: boolean

This maps the proteins to Pfam database to annotate BGC modules with functional information based on the protein families they contain. For more information see the antiSMASH documentation.

Modifies tool parameter(s):

antiSMASH: --pfam2go

Run RREFinder precision mode on all RiPP gene clusters.

type: boolean

This enables the prediction of regulatory elements on the BGC that help in the control of protein expression. For more information see the antiSMASH documentation.

Modifies tool parameter(s):

antiSMASH: --rre

Specify which taxonomic classification of input sequence to use.

type: string

This specifies which set of secondary metabolites to screen for, based on the taxon type the secondary metabolites are from.

This will run different pipelines depending on whether the input sequences are from bacteria or fungi.

For more information see the antiSMASH documentation.

Modifies tool parameter(s):

antiSMASH: --taxon

Run TFBS finder on all gene clusters.

type: boolean

This enables the prediction of transcription factor binding sites which control the gene expression. For more information see the antiSMASH documentation.

Modifies tool parameter(s):

antiSMASH: --tfbs

A deep learning genome-mining strategy for biosynthetic gene cluster prediction. More info: https://github.com/Merck/deepbgc/tree/master/deepbgc

Skip DeepBGC during the BGC screening.

type: boolean

Path to local DeepBGC database folder.

type: string

The contents of the database directory should include directories such as common, 0.1.0 in the top level.

For more information see the DeepBGC documentation.

Modifies tool parameter(s)

DeepBGC: environment variable DEEPBGC_DOWNLOADS_DIR

Average protein-wise DeepBGC score threshold for extracting BGC regions from Pfam sequences.

type: number

default: 0.5

The DeepBGC score threshold for extracting BGC regions from Pfam sequences based on average protein-wise value.

For more information see the DeepBGC documentation.

Modifies tool parameter(s)

DeepBGC: --score

Run DeepBGC's internal Prodigal step in single mode to restrict detecting genes to long contigs

type: boolean

By default DeepBGC's Prodigal runs in 'single genome' mode that requires sequence lengths to be equal or longer than 20000 characters.

However, more fragmented reads from MAGs often result in contigs shorter than this. Therefore, nf-core/funcscan will run with the meta mode by default, but providing this parameter allows to override this and run in single genome mode again.

For more information check the Prodigal documentation.

Modifies tool parameter(s)

DeepBGC: --prodigal-meta-mode

Merge detected BGCs within given number of proteins.

type: integer

Merge detected BGCs within given number of proteins.

For more information see the DeepBGC documentation.

Modifies tool parameter(s)

DeepBGC: --merge-max-protein-gap

Merge detected BGCs within given number of nucleotides.

type: integer

Merge detected BGCs within given number of nucleotides.

For more information see the DeepBGC documentation.

Modifies tool parameter(s)

DeepBGC: --merge-max-nucl-gap

Minimum BGC nucleotide length.

type: integer

default: 1

Minimum length a BGC must have (in bp) to be reported as detected.

For more information see the DeepBGC documentation.

Modifies tool parameter(s)

DeepBGC: --min-nucl

Minimum number of proteins in a BGC.

type: integer

default: 1

Minimum number of proteins in a BGC must have to be reported as 'detected'.

For more information see the DeepBGC documentation.

Modifies tool parameter(s)

DeepBGC: --min-proteins

Minimum number of protein domains in a BGC.

type: integer

default: 1

Minimum number of domains a BGC must have to be reported as 'detected'.

For more information see the DeepBGC documentation.

Modifies tool parameter(s)

DeepBGC: --min-domains

Minimum number of known biosynthetic (as defined by antiSMASH) protein domains in a BGC.

type: integer

Minimum number of biosynthetic protein domains a BGC must have to be reported as 'detected'. This is based on antiSMASH definitions.

For more information see the DeepBGC documentation.

Modifies tool parameter(s)

DeepBGC: --min-bio-domains

DeepBGC classification score threshold for assigning classes to BGCs.

type: number

default: 0.5

DeepBGC classification score threshold for assigning classes to BGCs.

For more information see the DeepBGC documentation.

Modifies tool parameter(s)

DeepBGC: --classifier-score

Biosynthetic gene cluster detection using Conditional Random Fields (CRFs). More info: https://gecco.embl.de

Skip GECCO during the BGC screening.

type: boolean

Enable unknown region masking to prevent genes from stretching across unknown nucleotides.

type: boolean

Enable unknown region masking to prevent genes from stretching across unknown nucleotides during ORF detection based on P(y)rodigal.

For more information see the GECCO documentation.

Modifies tool parameter(s):

GECCO: --mask

The minimum number of coding sequences a valid cluster must contain.

type: integer

default: 3

Specify the number of consecutive genes a hit must have to be considered as part of a possible BGC region during BGC extraction.

For more information see the GECCO documentation.

Modifies tool parameter(s):

GECCO: --cds

The p-value cutoff for protein domains to be included.

type: number

default: 1e-9

For more information see the GECCO documentation.

Modifies tool parameter(s):

GECCO: --pfilter

The probability threshold for cluster detection.

type: number

default: 0.8

Specify the minimum probability a predicted gene must have to be considered as part of a BGC during BGC extraction.

Reducing this value may increase number and length of hits, but will reduce the accuracy of the predictions.

For more information see the GECCO documentation.

Modifies tool parameter(s):

GECCO: --threshold

The minimum number of annotated genes that must separate a cluster from the edge.

type: integer

The minimum number of annotated genes that must separate a possible BGC cluster from the edge. Edge clusters will still be included if they are longer. A lower number will increase the number of false positives on small contigs. Used during BGC extraction.

For more information see the GECCO documentation.

Modifies tool parameter(s):

GECCO: --edge-distance

Biosynthetic Gene Cluster detection based on predefined HMM models. This tool implements methods using probabilistic models called profile hidden Markov models (profile HMMs) to search against a sequence database. More info: http://eddylab.org/software/hmmer/Userguide.pdf

Run hmmsearch during BGC screening.

type: boolean

hmmsearch is not run by default because HMM model files must be provided by the user with the flag bgc_hmmsearch_models.

Specify path to the BGC hmm model file(s) to search against. Must have quotes if wildcard used.

type: string

hmmsearch performs biosequence analysis using profile hidden Markov Models. The models are specified in.hmm files that are specified with this parameter, e.g.:

--bgc_hmmsearch_models '/<path>/<to>/<models>/*.hmm'

You must wrap the path in quotes if you use a wildcard, to ensure Nextflow expansion not bash! When using quotes, the absolute path to the HMM file(s) has to be given.

For more information check the HMMER documentation.

Saves a multiple alignment of all significant hits to a file.

type: boolean

Save a multiple alignment of all significant hits (those satisfying inclusion thresholds) to a file.

For more information check the HMMER documentation.

Modifies tool parameter(s):

hmmsearch: -A

Save a simple tabular file summarising the per-target output.

type: boolean

Save a simple tabular (space-delimited) file summarizing the per-target output, with one data line per homologous target sequence found.

For more information check the HMMER documentation.

Modifies tool parameter(s)

hmmsearch: --tblout

Save a simple tabular file summarising the per-domain output.

type: boolean

Save a simple tabular (space-delimited) file summarizing the per-domain output, with one data line per homologous domain detected in a query sequence for each homologous model.

For more information check the HMMER documentation.

Modifies tool parameter(s)

hmmsearch:--domtblout

Parameters used to describe centralised config profiles. These should not be edited.

Git commit id for Institutional configs.

hidden

type: string

default: master

Base directory for Institutional configs.

hidden

type: string

default: https://raw.githubusercontent.com/nf-core/configs/master

If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter.

Institutional config name.

hidden

type: string

Institutional config description.

hidden

type: string

Institutional config contact information.

hidden

type: string

Institutional config URL link.

hidden

type: string

Less common options for the pipeline, typically set in a config file.

Display version and exit.

hidden

type: boolean

Method used to save pipeline results to output directory.

hidden

type: string

The Nextflow publishDir option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See Nextflow docs for details.

Email address for completion summary, only when pipeline fails.

hidden

type: string

pattern: ^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$

An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully.

Send plain-text email instead of HTML.

hidden

type: boolean

File size limit when attaching MultiQC reports to summary emails.

hidden

type: string

default: 25.MB

pattern: ^\d+(\.\d+)?\.?\s*(K|M|G|T)?B$

Do not use coloured log outputs.

hidden

type: boolean

Incoming hook URL for messaging service

hidden

type: string

Incoming hook URL for messaging service. Currently, MS Teams and Slack are supported.

Custom config file to supply to MultiQC.

hidden

type: string

Custom logo file to supply to MultiQC. File name must also be set in the MultiQC config file

hidden

type: string

Custom MultiQC yaml file containing HTML including a methods description.

type: string

Boolean whether to validate parameters against the schema at runtime

hidden

type: boolean

default: true

Base URL or local path to location of pipeline test dataset files

hidden

type: string

default: https://raw.githubusercontent.com/nf-core/test-datasets/

Suffix to add to the trace report filename. Default is the date and time in the format yyyy-MM-dd_HH-mm-ss.

hidden

type: string