Introduction

The output of nf-core/funcscan provides reports for each of the functional groups:

As a general workflow, we recommend to first look at the summary reports (ARGs, AMPs, BGCs), to get a general overview of what hits have been found across all the tools of each functional group. After which, you can explore the specific output directories of each tool to get more detailed information about each result. The tool-specific output directories also includes the output from the functional annotation steps of either prokka, pyrodigal, prodigal, or Bakta if the --save_annotations flag was set. Additionally, taxonomic classifications from MMseqs2 are saved if the --taxa_classification_mmseqs_db_savetmp and --taxa_classification_mmseqs_taxonomy_savetmp flags are set.

Similarly, all downloaded databases are saved (i.e. from MMseqs2, antiSMASH, AMRFinderPlus, Bakta, DeepARG, RGI, and/or AMPcombi) into the output directory <outdir>/databases/ if the --save_db flag was set.

Furthermore, for reproducibility, versions of all software used in the run is presented in a MultiQC report.

The directories listed below will be created in the results directory (specified by the --outdir flag) after the pipeline has finished. All paths are relative to this top-level output directory. The default directory structure of nf-core/funcscan is:

results/
├── taxonomic_classification/
|   └── mmseqs_createtsv/
├── annotation/
|   ├── bakta/
|   ├── prodigal/
|   ├── prokka/
|   └── pyrodigal/
├── amp/
|   ├── ampir/
|   ├── amplify/
|   ├── hmmsearch/
|   └── macrel/
├── arg/
|   ├── abricate/
|   ├── amrfinderplus/
|   ├── deeparg/
|   ├── fargene/
|   ├── hamronization/
|   └── rgi/
├── bgc/
|   ├── antismash/
|   ├── deepbgc/
|   ├── gecco/
|   └── hmmsearch/
├── qc/
|   └── seqkit/
├── reports/
|   ├── ampcombi/
|   ├── combgc/
|   └── hamronization_summarize/
├── databases/
├── multiqc/
└── pipeline_info/
work/

Pipeline overview

The pipeline is built using Nextflow and processes prokaryotic sequence data through the following steps:

Input contig QC with:

  • SeqKit (default) - for separating into long- and short- categories

Taxonomy classification of nucleotide sequences with:

  • MMseqs2 (default) - for contig taxonomic classification using 2bLCA.

ORF prediction and annotation with any of:

  • Pyrodigal (default) – for open reading frame prediction.
  • Prodigal – for open reading frame prediction.
  • Prokka – open reading frame prediction and functional protein annotation.
  • Bakta – open reading frame prediction and functional protein annotation.

Antimicrobial Resistance Genes (ARGs):

  • ABRicate – antimicrobial resistance gene detection, based on alignment to one of several databases.
  • AMRFinderPlus – antimicrobial resistance gene detection, using NCBI’s curated Reference Gene Database and curated collection of Hidden Markov Models.
  • DeepARG – antimicrobial resistance gene detection, using a deep learning model.
  • fARGene – antimicrobial resistance gene detection, using Hidden Markov Models.
  • RGI – antimicrobial resistance gene detection, based on alignment to the CARD database.

Antimicrobial Peptides (AMPs):

  • ampir – antimicrobial peptide detection, based on a supervised statistical machine learning approach.
  • amplify – antimicrobial peptide detection, using a deep learning model.
  • hmmsearch – antimicrobial peptide detection, based on hidden Markov models.
  • Macrel – antimicrobial peptide detection, using a machine learning approach.

Biosynthetic Gene Clusters (BGCs):

  • antiSMASH – biosynthetic gene cluster detection.
  • deepBGC - biosynthetic gene cluster detection, using a deep learning model.
  • GECCO – biosynthetic gene cluster detection, using Conditional Random Fields (CRFs).
  • hmmsearch – biosynthetic gene cluster detection, based on hidden Markov models.

Output Summaries:

  • AMPcombi – summary report of antimicrobial peptide gene output from various detection tools.
  • hAMRonization – summary of antimicrobial resistance gene output from various detection tools.
  • comBGC – summary of biosynthetic gene cluster output from various detection tools.
  • MultiQC – report of all software and versions used in the pipeline.
  • Pipeline information – report metrics generated during the workflow execution.

Tool details

Taxonomic classification tool

Output files
  • taxonomic_classification/mmseqs2_createtsv/
    • <samplename>/:
      • *.tsv: tab-separated table containing the taxonomic lineage of every contig. When a contig cannot be classified according to the database, it is assigned in the ‘lineage’ column as ‘no rank | unclassified’.
  • reports/<workflow>/<workflow>_complete_summary_taxonomy.tsv.gz: tab-separated table containing the concatenated results from the summary tables along with the taxonomic classification if the parameter --run_taxa_classification is called.

MMseqs2 classifies the taxonomic lineage of contigs based on the last common ancestor. The inferred taxonomic lineages are included in the final workflow summaries to annotate the potential source bacteria of the identified AMPs, ARGs, and/or BGCs.

Annotation tools

Pyrodigal, Prodigal, Prokka, Bakta

Annotation tools

Prodigal

Output files
  • prodigal/
    • category/: indicates whether annotation files are of all contigs or long-only contigs (BGC subworkflow only)
      • <samplename>/:
        • *.fna: nucleotide FASTA file of the input contig sequences
        • *.faa: protein FASTA file of the translated CDS sequences
        • *.gbk: annotation in GBK format, containing both sequences and annotations

Descriptions taken from the Prodigal documentation

Prodigal annotates whole (meta-)genomes by identifying ORFs in a set of genomic DNA sequences. The output is used by some of the functional screening tools.

Pyrodigal

Output files
  • pyrodigal/
    • category/: indicates whether annotation files are of all contigs or long-only contigs (BGC subworkflow only)
      • <samplename>/:
        • *.gbk: annotation in GBK format, containing both sequences and annotations
        • *.fna: nucleotide FASTA file of the annotated CDS sequences
        • *.faa: protein FASTA file of the translated CDS sequences

Descriptions taken from the Pyrodigal documentation

Pyrodigal annotates whole (meta-)genomes by identifying ORFs in a set of genomic DNA sequences. It produces the same results as Prodigal while being more resource-optimized, thus faster. Unlike Prodigal, Pyrodigal cannot produce output in GenBank format. The output is used by some of the functional screening tools.

Prokka

Output files
  • prokka/
    • category/: indicates whether annotation files are of all contigs or long-only contigs (BGC subworkflow only)
      • <samplename>/
        • *.gff: annotation in GFF3 format, containing both sequences and annotations
        • *.gbk: standard Genbank file derived from the master .gff
        • *.fna: nucleotide FASTA file of the input contig sequences
        • *.faa: protein FASTA file of the translated CDS sequences
        • *.ffn: nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)
        • *.sqn: an ASN1 format “Sequin” file for submission to Genbank
        • *.fsa: nucleotide FASTA file of the input contig sequences, used by “tbl2asn” to create the .sqn file
        • *.tbl: feature Table file, used by “tbl2asn” to create the .sqn file
        • *.err: unacceptable annotations - the NCBI discrepancy report
        • *.log: logging output that Prokka produced during its run
        • *.txt: statistics relating to the annotated features found
        • *.tsv: tab-separated file of all features

Descriptions directly from the Prokka documentation

Prokka performs whole genome annotation to identify features of interest in a set of (meta-)genomic DNA sequences. The output is used by some of the functional screening tools.

Bakta

Output files
  • bakta/
    • category/: indicates whether annotation files are of all contigs or long-only contigs (BGC only)
      • <samplename>
        • <samplename>.gff3: annotations & sequences in GFF3 format
        • <samplename>.gbff: annotations & sequences in (multi) GenBank format
        • <samplename>.ffn: feature nucleotide sequences as FASTA
        • <samplename>.fna: replicon/contig DNA sequences as FASTA
        • <samplename>.embl: annotations & sequences in (multi) EMBL format
        • <samplename>.faa: CDS/sORF amino acid sequences as FASTA
        • <samplename>_hypothetical.faa: further information on hypothetical protein CDS as simple human readble tab separated values
        • <samplename>_hypothetical.tsv: hypothetical protein CDS amino acid sequences as FASTA
        • <samplename>.tsv: annotations as simple human readble TSV
        • <samplename>.txt: summary in TXT format

Descriptions taken from the Bakta documentation.

Bakta is a tool for the rapid & standardised annotation of bacterial genomes and plasmids from both isolates and MAGs. It provides dbxref-rich, sORF-including and taxon-independent annotations in machine-readable JSON & bioinformatics standard file formats for automated downstream analysis. The output is used by some of the functional screening tools.

AMP detection tools

ampir, AMPlify, hmmsearch, Macrel

ampir

Output files
  • ampir/
    • <samplename>.ampir.faa: predicted AMP sequences in FAA format
    • <samplename>.ampir.tsv: predicted AMP metadata in TSV format, contains contig name, sequence and probability score

ampir (antimicrobial peptide prediction in r) was designed to predict antimicrobial peptides (AMPs) from any given size protein dataset. ampir uses a supervised statistical machine learning approach to predict AMPs. It incorporates two support vector machine classification models, “precursor” and “mature” that have been trained on publicly available antimicrobial peptide data.

AMPlify

Output files
  • amplify/
    • *_results.tsv: table of contig amino-acid sequences with prediction result (AMP or non-AMP) and information on sequence length, charge, probability score, AMPlify log-scaled score)

AMPlify is an attentive deep learning model for antimicrobial peptide prediction. It takes in contig annotations (as protein sequences) and classifies them as either AMP or non-AMP.

hmmsearch

Output files
  • hmmersearch/
    • *.txt.gz: human readable output summarizing hmmsearch results
    • *.sto.gz: optional multiple sequence alignment (MSA) in Stockholm format
    • *.tbl.gz: optional tabular (space-delimited) summary of per-target output
    • *.domtbl.gz: optional tabular (space-delimited) summary of per-domain output

HMMER/hmmsearch is used for searching sequence databases for sequence homologs, and for making sequence alignments. It implements methods using probabilistic models called profile hidden Markov models (profile HMMs). hmmsearch is used to search one or more profiles against a sequence database.

Macrel

Output files
  • macrel_contigs/
    • *.smorfs.faa.gz: zipped fasta file containing amino acid sequences of small peptides (<100 aa, small open reading frames) showing the general gene prediction information in the contigs
    • *.all_orfs.faa.gz: zipped fasta file containing amino acid sequences showing the general gene prediction information in the contigs
    • prediction.gz: zipped file, with all predicted and non-predicted AMPs in a table format
    • *.md: readme file containing tool specific information (e.g. citations, details about the output, etc.)
    • *_log.txt: log file containing the information pertaining to the run

Macrel is a tool that mines antimicrobial peptides (AMPs) from (meta)genomes by predicting peptides from genomes (provided as contigs) and outputs predicted antimicrobial peptides that meet specific criteria/thresholds.

ARG detection tools

ABRicate, AMRFinderPlus, DeepARG, fARGene, RGI

ABRicate

Output files
  • abricate/
    • *.{csv,tsv}: search results in tabular format

ABRicate screens contigs for antimicrobial resistance or virulence genes. It comes bundled with multiple databases: NCBI, CARD, ARG-ANNOT, Resfinder, MEGARES, EcOH, PlasmidFinder, Ecoli_VF and VFDB.

AMRFinderPlus

Output files
  • amrfinderplus/
    • *.tsv: search results in tabular format

AMRFinderPlus relies on NCBI’s curated Reference Gene Database and curated collection of Hidden Markov Models. It identifies antimicrobial resistance genes, resistance-associated point mutations, and select other classes of genes using protein annotations and/or assembled nucleotide sequences.

DeepARG

Output files
  • deeparg/
    • *.align.daa*: Intermediate DIAMOND alignment output
    • *.align.daa.tsv: DIAMOND alignment output as .tsv
    • *.mapping.ARG: ARG predictions with a probability >= —prob (0.8 default).
    • *.mapping.potential.ARG: ARG predictions with a probability < —prob (0.8 default)

deepARG uses deep learning to characterize and annotate antibiotic resistance genes in metagenomes. It is composed of two models for two types of input: short sequence reads and gene-like sequences. In this pipeline we use the ls model, which is suitable for annotating full sequence genes and to discover novel antibiotic resistance genes from assembled samples. The tool DIAMOND is used as an aligner.

fARGene

Output files
  • fargene/
    • fargene_analysis.log: logging output that fARGene produced during its run
    • <sample_name>/:
      • hmmsearchresults/: output from intermediate hmmsearch step (only if --arg_fargene_savetmpfiles supplied)
      • predictedGenes/:
        • *-filtered.fasta: nucleotide sequences of predicted ARGs
        • *-filtered-peptides.fasta: amino acid sequences of predicted ARGs
      • results_summary.txt: text summary of results, listing predicted genes and ORFs for each input file
      • tmpdir/: temporary output files and fasta files (only if --arg_fargene_savetmpfiles supplied)

fARGene (Fragmented Antibiotic Resistance Gene Identifier) is a tool that takes either fragmented metagenomic data or longer sequences as input and predicts and delivers full-length antibiotic resistance genes as output. The tool includes developed and optimised models for a number of resistance gene types. By default the pipeline runs all models, thus you will receive output for all models. If only a sub-list or single model is required, this can be specified with the --hmm-model flag. Available models are:

  • class_a: class A beta-lactamases
  • class_b_1_2: subclass B1 and B2 beta-lactamases
  • class_b3: subclass B3 beta-lactamases
  • class_c: class C beta-lactamases
  • class_d_1, class_d_2: class D beta-lactamases
  • qnr: quinolone resistance genes
  • tet_efflux, tet_rpg, tet_enzyme: tetracycline resistance genes

RGI

Output files
  • rgi/
    • <samplename>.txt: hit results table separated by ’#’
    • <samplename>.json: hit results in json format (only if --arg_rgi_savejson supplied)
    • temp/:
      • <samplename>.fasta.temp.*.json: temporary json files, ’*’ stands for ‘homolog’, ‘overexpression’, ‘predictedGenes’ and ‘predictedGenes.protein’ (only if --arg_rgi_savetmpfiles supplied).

RGI (Resistance Gene Identifier) predicts resistome(s) from protein or nucleotide data based on homology and SNP models. It uses reference data from the Comprehensive Antibiotic Resistance Database (CARD).

BGC detection tools

antiSMASH, deepBGC, GECCO, hmmsearch.

Note that the BGC tools are run on a set of annotations generated on only long contigs (3000 bp or longer) by default. These specific filtered FASTA files are under bgc/seqkit/, and annotations files are under annotation/<annotation_tool>/long/, if the corresponding saving flags are specified (see parameter docs). However the same annotations should also be annotation files in the sister all/ directory.

Input contig QC

Output files
  • seqkit/
    • <samplename>_long.fasta: FASTA file containing contigs equal or longer than the threshold set by --contig_qc_lengththreshold used in BGC subworkflow

SeqKit is a cross-platform and ultrafast toolkit for FASTA/Q file manipulation.

Note that filtered FASTA is only used for BGC workflow for run-time optimisation and biological reasons. All contigs are otherwise screened in ARG/AMP workflows.

antiSMASH

Output files
  • antismash/
    • css: accessory files for the HTML output
    • clusterblastoutput.txt (optional): raw BLAST output of known clusters previously predicted by antiSMASH using the built-in ClusterBlast algorithm
    • images: accessory files for the HTML output
    • index.html: interactive web view of results in HTML format
    • js: accessory files for the HTML output
    • knownclusterblast/: directory with MIBiG hits (optional)
      • *_c*.txt: tables with MIBiG hits
    • knownclusterblastoutput.txt (optional): raw BLAST output of known clusters of the MIBiG database.
    • regions.js: sideloaded annotations of protoclusters and/or subregions
    • *region*.gbk: nucleotide sequence + annotations in GenBank file format; one file per antiSMASH hit.
    • <sample name>.gbk: nucleotide sequence and annotations in GenBank format; converted from input file
    • <sample name>.json: nucleotide sequence and annotations in JSON format; converted from GenBank file
    • <sample name>.log: logging output that antiSMASH produced during its run
    • <sample name>.zip: compressed version of the output folder in zip format

antiSMASH (antibiotics & Secondary Metabolite Analysis SHell) is a tool for rapid genome-wide identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial genomes. It identifies biosynthetic loci covering all currently known secondary metabolite compound classes in a rule-based fashion using profile HMMs and aligns the identified regions at the gene cluster level to their nearest relatives from a database containing experimentally verified gene clusters (MIBiG).

deepBGC

Output files
  • deepbgc/
    • README.txt: Summary of output files generated
    • LOG.txt: Log output of DeepBGC
    • *.antismash.json: AntiSMASH JSON file for sideloading
    • *.bgc.gbk: Sequences and features of all detected BGCs in GenBank format
    • *.bgc.tsv: Table of detected BGCs and their properties
    • *.full.gbk: Fully annotated input sequence with proteins, Pfam domains (PFAM_domain features) and BGCs (cluster features)
    • *.pfam.tsv: Table of Pfam domains (pfam_id) from given sequence (sequence_id) in genomic order, with BGC detection scores
    • evaluation/
      • *.bgc.png: Detected BGCs plotted by their nucleotide coordinates
      • *.pr.png: Precision-Recall curve based on predicted per-Pfam BGC scores
      • *.roc.png: ROC curve based on predicted per-Pfam BGC scores
      • *.score.png: BGC detection scores of each Pfam domain in genomic order

deepBGC detects BGCs in bacterial and fungal genomes using deep learning. DeepBGC employs a Bidirectional Long Short-Term Memory Recurrent Neural Network and a word2vec-like vector embedding of Pfam protein domains. Product class and activity of detected BGCs is predicted using a Random Forest classifier.

GECCO

Output files
  • gecco/
    • *.genes.tsv/: TSV file containing detected/predicted genes with BGC probability scores
    • *.features.tsv: TSV file containing identified domains
    • *.clusters.tsv: TSV file containing coordinates of predicted clusters and BGC types
    • *_cluster_*.gbk: GenBank file (if clusters were found) containing sequence with annotations; one file per GECCO hit

GECCO is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).

Summary tools

AMPcombi, hAMRonization, comBGC, MultiQC, pipeline information

AMPcombi

Output files
  • ampcombi/
    • Ampcombi_summary.tsv: tab-separated table containing the concatenated and filtered results from each AMPcombi summary table. This is the output given when the taxonomic classification is not activated (pipeline default).
    • Ampcombi_parse_tables.log: log file containing the run information from AMPcombi submodule ampcombi2/parsetables
    • Ampcombi_complete.log: log file containing the run information from AMPcombi submodule ampcombi2/complete
    • Ampcombi_summary_cluster.tsv: tab-separated table containing the clustered AMP hits. This is the output given when the taxonomic classification is not activated (pipeline default).
    • Ampcombi_summary_cluster_representative_seq.tsv: tab-separated table containing the representative sequence of each cluster. This can be used in AMPcombi for constructing 3D structures using ColabFold. For more details on how to do this, please refer to the AMPcombi documentation.
    • Ampcombi_cluster.log: log file containing the run information from AMPcombi submodule ampcombi2/cluster
    • ampcombi_complete_summary_taxonomy.tsv.gz: summarised output from all AMP workflow tools with taxonomic assignment in compressed tsv format. This is the same output as Ampcombi_summary_cluster.tsv file but with taxonomic classification of the contig.
    • <sample>/contig_gbks: contains all the contigs in gbk format that an AMP was found on using the custom parameters
    • <sample>/*_ampcombi.log: a log file generated by AMPcombi
    • <sample>/*_ampcombi.tsv: summarised output in tsv format for each sample
    • <sample>/*_amp.faa*: fasta file containing the amino acid sequences for all AMP hits for each sample
    • <sample>/*_diamond_matches.txt*: alignment file generated by DIAMOND for each sample
    AMP summary table header descriptions
Table columnDescription
nameName of the sample
contig_idContig header
prob_macrelProbability associated with the AMP prediction using MACREL
prob_neubiProbability associated with the AMP prediction using NEUBI
prob_ampirProbability associated with the AMP prediction using AMPIR
prob_amplifyProbability associated with the AMP prediction using AMPLIFY
evalue_hmmerExpected number of false positives (nonhomologous sequences) with a similar of higher score. This stands for how significant the hit is, the lower the evalue, the more significant the hit
aa_sequenceAmino acid sequence that forms part of the contig and is AMP encoding
target_idDRAMP ID within the database found to be similar to the predicted AMP by DIAMOND alignment
pidentPercentage identity of amino acid residues that fully aligned between the DRAMP sequence and the predicted AMP sequence
evalueNumber of alignments of similar or better qualities that can be expected when searching a database of similar size with a random sequence distribution. This is generated by DIAMOND alignments using the DRAMP AMP database. The lower the value the more significant that the hit is positive. An e-value of < 0.001 means that the this hit will be found by chance once per 1,0000 queries
SequenceSequence corresponding to the DRAMP ID found to be similar to the predicted AMP sequence
Sequence_lengthNumber of amino acid residues in the DRAMP sequence
NameFull name of the peptide copied from the database it was uploaded to
Swiss_Prot_EntryEntry name of the peptide within the UniProtKB/Swiss-Prot database
FamilyName of the family, group or class of AMPs this peptide belongs to, e.g. bacteriocins
GeneName of the gene (if available in the database) that encodes the peptide
SourceName of the source organism (if available in the database) from which the peptide was extracted
ActivityPeptide activity, e.g. Antimicrobial, Antibacterial, Anti-Gram+, Anti-Gram-, Insecticidal or Antifungal
Protein_existencePeptide status, e.g. only a homology, protein level, predicted or transcript level
StructureType of peptide structure, e.g. alpha helix, bridge, etc.
Structure_DescriptionFurther description of the structure if available
PDB_IDThe ID of an equivalent peptide found in the protein data bank PDB
CommentsFurther details found in the database regarding the peptide
Target_OrganismName of the target organism to which the peptide is effective against
Hemolytic_activityType of hemolytic activity if any
Linear/Cyclic/BranchedWhether the hit is a linear, cyclic or branched peptide
N-terminal_ModificationWhether it contains N-terminal_Modification
C-termina_ModificationWhether it contains C-terminal_Modification
Other_ModificationsWhether there are any other modifications found in the peptide structure
StereochemistryType of peptide stereochemistry if available
CytotoxicityCytotoxicity mechanism of the peptide if available
Binding_TargetPeptide binding target, e.g. lipid, cell membrane or chitin binding
Pubmed_IDPubmed ID if a publication is associated with the peptide
ReferenceCitation of the associated publication if available
AuthorAuthors’ names associated with the publication or who have uploaded the peptide
TitlePublication title if available
...

AMPcombi summarizes the results of antimicrobial peptide (AMP) prediction tools (ampir, AMPlify, Macrel, and other non-nf-core tools) into a single table and aligns the hits against a reference AMP database for functional and taxonomic classification. It assigns the physiochemical properties (e.g. hydrophobicity, molecular weight) using the Biopython toolkit. Additionally, it clusters the resulting AMP hits from all samples using MMseqs2. For further filtering for AMPs with signaling peptides, the output file Ampcombi_summary_cluster.tsv or ampcombi_complete_summary_taxonomy.tsv.gz can be used downstream as detailed here.

hAMRonization

Output files
  • hamronization_summarize/ one of the following:
    • hamronization_combined_report.json: summarised output in .json format
    • hamronization_combined_report.tsv: summarised output in .tsv format when the taxonomic classification is turned off (pipeline default).
    • hamronization_combined_report.tsv.gz: summarised output in gzipped format when the taxonomic classification is turned on by --run_taxa_classification.
    • hamronization_combined_report.html: interactive output in .html format
ARG summary table headers
Table columnDescription
input_file_nameName of the file containing the sequence data to be analysed
gene_symbolShort name of a gene; a single word that does not contain white space characters. It is typically derived from the gene name
gene_nameName of a gene
reference_database_nameIdentifier of a biological or bioinformatics database
reference_database_versionVersion of the database containing the reference sequences used for analysis
reference_accessionIdentifier that specifies an individual sequence record in a public sequence repository
analysis_software_nameName of a computer package, application, method or function used for the analysis of data
analysis_software_versionVersion of software used to analyze data
genetic_variation_typeClass of genetic variation detected
antimicrobial_agent (optional)A substance that kills or slows the growth of microorganisms, including bacteria, viruses, fungi and protozoans
coverage_percentage (optional)Percentage of the reference sequence covered by the sequence of interest
coverage_depth (optional)Average number of reads representing a given nucleotide in the reconstructed sequence
coverage_ratio (optional)Ratio of the reference sequence covered by the sequence of interest.
drug_class (optional)Set of antibiotic molecules, with similar chemical structures, molecular targets, and/or modes and mechanisms of action
input_gene_length (optional)Length (number of positions) of a target gene sequence submitted by a user
input_gene_start (optional)Position of the first nucleotide in a gene sequence being analysed (input gene sequence)
input_gene_stop (optional)Position of the last nucleotide in a gene sequence being analysed (input gene sequence)
input_protein_length (optional)Length (number of positions) of a protein target sequence submitted by a user
input_protein_start (optional)Position of the first amino acid in a protein sequence being analysed (input protein sequence)
input_protein_stop (optional)Position of the last amino acid in a protein sequence being analysed (input protein sequence)
input_sequence_id (optional)Identifier of molecular sequence(s) or entries from a molecular sequence database
nucleotide_mutation (optional)Nucleotide sequence change(s) detected in the sequence being analysed compared to a reference
nucleotide_mutation_interpretation (optional)Description of identified nucleotide mutation(s) that facilitate clinical interpretation
predicted_phenotype (optional)Characteristic of an organism that is predicted rather than directly measured or observed
predicted_phenotype_confidence_level (optional)Confidence level in a predicted phenotype
amino_acid_mutation (optional)Amino acid sequence change(s) detected in the sequence being analysed compared to a reference
amino_acid_mutation_interpretation (optional)Description of identified amino acid mutation(s) that facilitate clinical interpretation.
reference_gene_length (optional)Length (number of positions) of a gene reference sequence retrieved from a database
reference_gene_start (optional)Position of the first nucleotide in a reference gene sequence
reference_gene_stop (optional)Position of the last nucleotide in a reference sequence
reference_protein_length (optional)Length (number of positions) of a protein reference sequence retrieved from a database
reference_protein_start (optional)Position of the first amino acid in a reference protein sequence
reference_protein_stop (optional)Position of the last amino acid in a reference protein sequence
resistance_mechanism (optional)Antibiotic resistance mechanisms evolve naturally via natural selection through random mutation, but it could also be engineered by applying an evolutionary stress on a population
strand_orientation (optional)Orientation of a genomic element on the double-stranded molecule
sequence_identity (optional)Sequence identity is the number (%) of matches (identical characters) in positions from an alignment of two molecular sequences

hAMRonization summarizes the outputs of the antimicrobial resistance gene detection tools (ABRicate, AMRFinderPlus, DeepARG, fARGene, RGI) into a single unified tabular format. It supports a variety of summary options including an interactive summary.

comBGC

Output files
  • comBGC/
    • combgc_complete_summary.tsv: summarised output from all BGC detection tools used in tsv format (all samples concatenated). This is the output given when the taxonomic classification is not activated (pipeline default).
    • combgc_complete_summary.tsv.gz: summarised output in gzipped format from all BGC detection tools used in tsv format (all samples concatenated) along with the taxonomic classification obtained when --run_taxa_classification is activated.
    • */combgc_summary.tsv: summarised output from all applied BGC detection tools in tsv format for each sample.
BGC summary table headers
Table columnDescription
Sample_IDID of the sample
Prediction_toolBGC prediction tool (antiSMASH, DeepBGC, and/or GECCO)
Contig_IDID of the contig containing the candidate BGC
Product_classPredicted BGC type/class
BGC_probabilityConfidence of BGC candidate as inferred by the respective tool
BGC_completeWhether BGC sequence is assumed to be complete or truncated by the edge of the contig
BGC_startPredicted BGC start position on the ontig
BGC_endPredicted BGC end position on the contig
BGC_lengthLength of the predicted BGC
CDS_IDID of the coding sequence(s) (CDS) from the annotation step (prodigal/prokka/bakta) if provided by the tool
CDS_countNumber of CDSs the BGC contains
PFAM_domainsInferred PFAM IDs or annotations if provided by the tool
MIBiG_IDInferred MIBiG IDs if provided by the tool
InterPro_IDInferred InterPro IDs if provided by the tool

comBGC is a tool built for nf-core/funcscan which summarizes the results of the Biosynthetic Gene Cluster (BGC) prediction tools (antiSMASH, deepBGC, GECCO) used in the pipeline into one comprehensive tabular summary with standardised headers.

ℹ️ comBGC does not feature hmmer_hmmsearch support. Please check the hmmsearch results directory.

MultiQC

Output files
  • multiqc/
    • multiqc_report.html: a standalone HTML file that can be viewed in your web browser
    • multiqc_data/: directory containing raw parsed data used for MultiQC report rendering
    • multiqc_plots/: directory containing any static images from the report in various formats

MultiQC is used in nf-core/funcscan to report the versions of all software used in the given pipeline run, and provides a suggested methods text. This allows for reproducible analysis and transparency in method reporting in publications.

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
    • Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.