Tutorials

This page provides a range of tutorials to help give you a bit more guidance on how to set up nf-core/taxprofiler runs in the wild.

Simple Tutorial

In this tutorial we will run you through a simple set up of a small nf-core/taxprofiler run. It assumes that you have basic knowledge of metagenomic classification input and output files.

Preparation

Hardware

The datasets used should be small enough to run on your own laptop or a single server node.

If you wish to use a HPC cluster or cloud, and don’t wish to use an ‘interactive’ session submitted to your scheduler, please see the nf-core documentation on how to make a relevant config file.

You will need internet access and at least 1.5 GB of hardrive space.

Software

The tutorial assumes you are on a Unix based operating system, and have already installed Nextflow as well a software environment system such as Conda, Docker, or Singularity/Apptainer. The tutorial will use Docker, however you can simply replace references to docker with conda, singularity, or apptainer accordingly.

Data

First we will make a directory to run the whole tutorial in.

mkdir taxprofiler-tutorial
cd taxprofiler-tutorial/

We will use very small short-read (pre-subset) metagenomes used for testing. nf-core/taxprofiler accepts FASTQ or FASTA files as input formats, however we will use FASTQ here as the more common format in taxonomic classification. You can download these metagenomes with the following command.

curl -O https://raw.githubusercontent.com/nf-core/test-datasets/taxprofiler/data/fastq/ERX5474932_ERR5766176_1.fastq.gz
curl -O https://raw.githubusercontent.com/nf-core/test-datasets/taxprofiler/data/fastq/ERX5474932_ERR5766176_2.fastq.gz
curl -O https://raw.githubusercontent.com/nf-core/test-datasets/taxprofiler/data/fastq/ERX5474932_ERR5766176_B_1.fastq.gz
curl -O https://raw.githubusercontent.com/nf-core/test-datasets/taxprofiler/data/fastq/ERX5474932_ERR5766176_B_2.fastq.gz

In this tutorial we will demonstrate running with three different profilers, and in one of those cases, running the same database twice but with different parameters. The database consists of two genomes of species known to be present in the metagenomes. You can download the databases for Kraken2, Centrifuge, and Kaiju with the following commands.

curl -O https://raw.githubusercontent.com/nf-core/test-datasets/taxprofiler/data/database/kraken2/testdb-kraken2.tar.gz
curl -O https://raw.githubusercontent.com/nf-core/test-datasets/taxprofiler/data/database/centrifuge/test-db-centrifuge.tar.gz
curl -O https://raw.githubusercontent.com/nf-core/test-datasets/taxprofiler/data/database/kaiju/kaiju.tar.gz

To demonstrate that nf-core/taxprofiler can also accept databases as uncompressed folders, we can extract one of them.

tar -xzf kaiju.tar.gz
Note

You must provide these databases pre-built to the pipeline, nf-core/taxprofiler neither comes with default databases not can generate databases for you. For guidance on how to build databases, see the Retrieving databases or building custom databases tutorial.

Finally, an important step of any metagenomic classification is to remove contamination. Contamination can come from many places, typically from the host of a host-associated sample, however this can also come from laboratory processing samples. A common contaminant in Illumina sequencing is a spike-in control of the genome of PhiX virus, which we can download with the following command.

curl -O https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/819/615/GCF_000819615.1_ViralProj14015/GCF_000819615.1_ViralProj14015_genomic.fna.gz

Preparing Input

Sample sheet

You provide the sequencing data FASTQ files to nf-core/taxprofiler via a input ‘sample sheet’ .csv file. This is a 6 column table, that includes sample and library names, instrument platform, and paths to the sequencing data.

Open a text editor, and create a file called samplesheet.csv. Copy and paste the following lines into the file and save it.

samplesheet.csv
sample,run_accession,instrument_platform,fastq_1,fastq_2,fasta
ERX5474932,ERR5766176,ILLUMINA,ERX5474932_ERR5766176_1.fastq.gz,ERX5474932_ERR5766176_2.fastq.gz,
ERX5474932,ERR5766176_B,ILLUMINA,ERX5474932_ERR5766176_B_1.fastq.gz,ERX5474932_ERR5766176_B_2.fastq.gz,

Here we have specified two libraries of the same sample, that they were sequencing on Illumina platforms, and the paths to the FASTQ files. If you had placed your FASTQ files elsewhere, you would give the full path (i.e., with relevant directories) to the fastq_1, fastq_2 and fasta columns.

Database sheet

For the database(s), you also supply these via a .csv file. This 4 column table contains the tool the database has been built for, a database name, the parameters you wish reads to be queried against the given database with, and a path to a .tar.gz archive file or a directory containing the database files.

Open a text editor, and create a file called database.csv. Copy and paste the following csv file into the file and save it.

database.csv
tool,db_name,db_params,db_path
kraken2,db1,--quick,testdb-kraken2.tar.gz
centrifuge,db2,,test-db-centrifuge.tar.gz
centrifuge,db2_trimmed,--trim5 2 --trim3 2,test-db-centrifuge.tar.gz
kaiju,db3,,kaiju/

You can see here we have specified the Centrifuge database twice, to allow comparison of different settings. Note that the each database of the same tool has a unique name. Furthermore, while the Kraken2 and Centrifuge databases have been supplied as .tar.gz archives, the Kaiju database has been supplied as a directory.

Running the pipeline

Now that we have the sequencing reads (in FASTQ format), the databases (directory or .tar.gz), and a reference genome (FASTA, optionally gzipped), we can now run them with the pipeline. The following command will perform short read quality control, remove contaminant reads, merge multiple libraries for each sample, run the three profilers, and finally generate standardised profiles.

nextflow run nf-core/taxprofiler -r 1.1.0 -profile docker \
--input samplesheet.csv --databases database.csv --outdir ./results \
--perform_shortread_qc \
--perform_shortread_hostremoval --hostremoval_reference GCF_000819615.1_ViralProj14015_genomic.fna.gz \
--perform_runmerging --save_runmerged_reads \
--run_centrifuge --run_kaiju --run_kraken2 \
--run_profile_standardisation \
--max_cpus 2 --max_memory '6.GB'
Info

With all Docker containers pre-downloaded, this run took 2 minutes and 31 seconds on a laptop running Ubuntu 22.04.2 with 32 GB RAM and 16 CPUs. If you are running nf-core/taxprofiler for the first time, expect this command to take longer as Nextflow will have to download each software container for each step of the pipeline.

To break down each line of the command:

  • Tell Nextflow to run nf-core/taxprofiler with the particular version and using the Docker container system
  • Specify the input and outputs, i.e., paths to the samplesheet.csv, database.csv, and directory where to save the results
  • Turn on basic quality control of input reads: adapter clipping, length filtering, etc
  • Turn on the removal of host or contaminant reads, and specify the path to reference genome of this
  • Turn on run merging, i.e., combine the processed input reads of the multiple libraries into each sample, and save these reads (e.g. for downstream use)
  • Turn on the different taxonomic profiling tools you wish to use
  • Turn on profile standardisation and multi-sample taxon tables
  • (Optional) provide a cap to the maximum amount of resources each step/job of the pipeline can use
Warning

The --max_cpu, --max_memory, --max_time parameters do not increase the amount of memory a step of the pipeline uses! They simply prevent Nextflow requesting more than this threshold, e.g. more than available on your machine. To learn how to increase computational resource to the pipeline, see the central nf-core documentation.

The pipeline run can be represented (in a simplified format!) as follows

Loading graph

Tip

We hope you see the benefit of using pipelines for such a task!

Output

In the resulting directory results/ you will find a range of directories.

results/
├── bowtie2
├── centrifuge
├── fastp
├── fastqc
├── kaiju
├── kraken2
├── multiqc
├── pipeline_info
├── run_merging
├── samtools
└── taxpasta

To follow the same order as the command construction above

  • Pipeline run report is found in multiqc/ and resource statistics in pipeline_info
  • Short-read QC results are found in fastqc/ and fastp/
  • Host/contaminant removal results are found in bowtie2/ and samtools/
  • Lane merged preprocessed reads are found in run_merging/
  • Raw profiling results are found in kraken2/, centrifuge/, and kaiju/
  • Standardised profiles for all profiling tools and databases are found in taxpasta
Info

Within each classifier results directory, there will be one directory and ‘combined samples table’ per database.

Info

For read-preprocessing steps, only log files are stored in the results/ directories by default. Refer to the parameters tab of the nf-core/taxprofiler documentation for more options.

The general ‘workflow’ of going through the results will typically be reviewing the multiqc/multiqc_report.html file to get general statistics of the entire run, particularly of the preprocessing. You would then use the taxon tables in the taxpasta/ directory for downstream analysis, but referring to the classifier specific results directories when you require more detailed information on each classification.

Detailed descriptions of all results files can be found in the output tab of the nf-core/taxprofiler documentation.

Clean up

Once you have completed the tutorial, you can run the following command to delete all downloaded and output files.

rm -r taxprofiler-tutorial/
Warning

Don’t forget to change out of the directory above before trying to delete it!

Retrieving databases or building custom databases

Not all taxonomic profilers provide ready-made or default databases. Here we will give brief guidance on how to build custom databases for each supported taxonomic profiler.

You should always consult the documentation of each tool for more information, as here we only provide short minimal-tutorials as quick reference guides (with no guarantee they are up to date).

The following tutorials assumes you already have the tool available (e.g. installed locally, or via conda, docker etc.), and you have already downloaded the FASTA files you wish to build into a database.

Bracken custom database

Bracken does not require an independent database but rather builds upon Kraken2 databases. The pre-built Kraken2 databases hosted by Ben Langmead already contain the required files to run Bracken.

However, to build custom databases, you will need a Kraken2 database, the (average) read lengths (in bp) of your sequencing experiment, the K-mer size used to build the Kraken2 database, and Kraken2 available on your machine.

bracken-build -d <KRAKEN_DB_DIR> -k <KRAKEN_DB_KMER_LENGTH> -l <READLENGTH>
Tip

You can speed up database construction by supplying the threads parameter (-t).

Tip

If you do not have Kraken2 in your $PATH you can point to the binary with -x /<path>/<to>/kraken2.

Expected files in database directory
  • bracken
    • hash.k2d
    • opts.k2d
    • taxo.k2d
    • database100mers.kmer_distrib
    • database150mers.kmer_distrib

You can follow Bracken tutorial for more information.

Centrifuge custom database

To build a custom Centrifuge database, a user needs to download taxonomy files, make a custom seqid2taxid.map and combine the fasta files together.

In total, you need four components: a tab-separated file mapping sequence IDs to taxonomy IDs (--conversion-table), a tab-separated file mapping taxonomy IDs to their parents and rank, up to the root of the tree (--taxonomy-tree), a pipe-separated file mapping taxonomy IDs to a name (--name-table), and the reference sequences.

An example of custom seqid2taxid.map:

seqid2taxid.map
  NC_001133.9 4392
  NC_012920.1 9606
  NC_001134.8 4392
  NC_001135.5 4392
centrifuge-download -o taxonomy taxonomy
cat *.{fa,fna} > input-sequences.fna
centrifuge-build -p 4 --conversion-table seqid2taxid.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp input-sequences.fna taxprofiler_cf
Expected files in database directory
  • centrifuge
    • <database_name>.<number>.cf
    • <database_name>.<number>.cf
    • <database_name>.<number>.cf
    • <database_name>.<number>.cf

For the Centrifuge custom database documentation, see here.

DIAMOND custom database

To create a custom database for DIAMOND, the user should download and unzip the NCBI’s taxonomy files and the input FASTA files.

The download and build steps are as follows:

wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip
 
## warning: large file!
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz
 
## warning: takes a long time!
cat ../raw/*.faa | diamond makedb -d testdb-diamond --taxonmap prot.accession2taxid.FULL.gz --taxonnodes nodes.dmp --taxonnames names.dmp
 
## clean up
rm *dmp *txt *gz *prt *zip
Expected files in database directory
  • diamond
    • <database_name>.dmnd

A detailed description can be found here.

Kaiju custom database

A number of kaiju pre-built indexes for reference datasets are maintained by the developers of kaiju and made available on the kaiju website. These databases can directly be used to run the workflow with Kaiju.

In case the databases above do not contain your desired libraries, you can build a custom kaiju database. To build a kaiju database, you need three components: a FASTA file with the protein sequences, the NCBI taxonomy dump files, and you need to define the uppercase characters of the standard 20 amino acids you wish to include.

Warning

The headers of the protein fasta file must be numeric NCBI taxon identifiers of the protein sequences.

To download the NCBI taxonomy files, please run the following commands:

wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.zip
unzip new_taxdump.zip

To build the database, run the following command (the contents of taxdump must be in the same location where you run the command):

kaiju-mkbwt -a ACDEFGHIKLMNPQRSTVWY -o proteins proteins.faa
kaiju-mkfmi proteins
Tip

You can speed up database construction by supplying the threads parameter (-t).

Expected files in database directory
  • kaiju
    • kaiju_db_*.fmi
    • nodes.dmp
    • names.dmp

For the Kaiju database construction documentation, see here.

Kraken2 custom database

A number of database indexes have already been generated and maintained by @BenLangmead Lab, see here. These databases can directly be used to run the workflow with Kraken2 as well as Bracken.

In case the databases above do not contain your desired libraries, you can build a custom Kraken2 database. This requires two components: a taxonomy (consisting of names.dmp, nodes.dmp, and *accession2taxid) files, and the FASTA files you wish to include.

To pull the NCBI taxonomy, you can run the following:

kraken2-build --download-taxonomy --db <YOUR_DB_NAME>

You can then add your FASTA files with the following build command.

kraken2-build --add-to-library *.fna --db <YOUR_DB_NAME>

You can repeat this step multiple times to iteratively add more genomes prior building.

Once all genomes are added to the library, you can build the database (and optionally clean it up):

kraken2-build --build --db <YOUR_DB_NAME>
kraken2-build --clean --db <YOUR_DB_NAME>

You can then add the <YOUR_DB_NAME>/ path to your nf-core/taxprofiler database input sheet.

Expected files in database directory
  • kraken2
    • opts.k2d
    • hash.k2d
    • taxo.k2d

You can follow the Kraken2 tutorial for a more detailed description.

KrakenUniq custom database

For any KrakenUniq database, you require: taxonomy files, the FASTA files you wish to include, a seqid2mapid file, and a k-mer length.

First you must make a seqid2taxid.map file which is a two column text file containing the FASTA sequence header and the NCBI taxonomy ID for each sequence:

MT192765.1  2697049

Then make a directory (<DB_DIR_NAME>/), containing the seqid2taxid.map file, and your FASTA files in a subdirectory called library/ (these FASTA files can be symlinked). You must then run the taxonomy command on the <DB_DIR_NAME>/ directory, and then build it.

mkdir -p <DB_DIR_NAME>/library
mv `seqid2taxid.map` <DB_DIR_NAME>/
mv *.fna  <DB_DIR_NAME>/library
krakenuniq-download --db <DB_DIR_NAME>  taxonomy
krakenuniq-build --db <DB_DIR_NAME> --kmer-len 31
Tip

You can speed up database construction by supplying the threads parameter (--threads) to krakenuniq-build.

Expected files in database directory
  • krakenuniq
    • opts.k2d
    • hash.k2d
    • taxo.k2d
    • database.idx
    • taxDB

Please see the KrakenUniq documentation for more information.

MALT custom database

To build a MALT database, you need the FASTA files to include, and an (unzipped) MEGAN mapping ‘db’ file for your FASTA type. In addition to the input directory, output directory, and the mapping file database, you also need to specify the sequence type (DNA or Protein) with the -s flag.

malt-build -i <path>/<to>/<fasta>/*.{fna,fa,fasta} -a2t <path>/<to>/<map>.db -d <YOUR_DB_NAME>/  -s DNA

You can then add the <YOUR_DB_NAME>/ path to your nf-core/taxprofiler database input sheet.

Warning

MALT generates very large database files and requires large amounts of RAM. You can reduce both by increasing the step size -st (with a reduction in sensitivity).

Tip

MALT-build can be multi-threaded with -t to speed up building.

Expected files in database directory
  • malt
    • ref.idx
    • taxonomy.idx
    • taxonomy.map
    • index0.idx
    • table0.idx
    • table0.db
    • ref.inf
    • ref.db
    • taxonomy.tre

See the MALT manual for more information.

MetaPhlAn custom database

MetaPhlAn does not allow (easy) construction of custom databases. Therefore we recommend to use the prebuilt database of marker genes that is provided by the developers.

To perform this task, ensure that you have installed MetaPhlAn on your machine. Keep in mind that each version of MetaPhlAn aligns with a specific version of the database. Therefore, if you download the MetaPhlAn3 database, remember to include --mpa3 as a parameter for the database in the --databases CSV file.

metaphlan --install --bowtie2db <YOUR_DB_NAME>/

You can then add the <YOUR_DB_NAME>/ path to your nf-core/taxprofiler database input sheet.

Warning

It is generally not recommended to modify this database yourself, thus this is currently not supported in the pipeline. However, it is possible to customise the existing database by adding your own marker genomes following the instructions here.

Note

If using your own database is relevant for you, please contact the nf-core/taxprofiler developers on the nf-core slack and we will investigate supporting this.

Expected files in database directory
  • metaphlan4
    • mpa_vJan21_TOY_CHOCOPhlAnSGB_202103.pkl
    • mpa_vJan21_TOY_CHOCOPhlAnSGB_202103.fna.bz2
    • mpa_vJan21_TOY_CHOCOPhlAnSGB_202103.1.bt2l
    • mpa_vJan21_TOY_CHOCOPhlAnSGB_202103.2.bt2l
    • mpa_vJan21_TOY_CHOCOPhlAnSGB_202103.3.bt2l
    • mpa_vJan21_TOY_CHOCOPhlAnSGB_202103.4.bt2l
    • mpa_vJan21_TOY_CHOCOPhlAnSGB_202103.rev.1.bt2l
    • mpa_vJan21_TOY_CHOCOPhlAnSGB_202103.rev.2.bt2l
    • mpa_latest

More information on the MetaPhlAn database can be found here.

mOTUs custom database

mOTUs does not provide the ability to construct custom databases. Therefore we recommend to use the the prebuilt database of marker genes provided by the developers.

Warning

Do not change the directory name of the resulting database if moving to a central location The database name of db_mOTU/ is hardcoded in the mOTUs tool

To do this you need to have mOTUs installed on your machine.

motus downloadDB

Then supply the db_mOTU/ path to your nf-core/taxprofiler database input sheet.

Warning

The db_mOTU/ directory may be downloaded to somewhere in your Python’s site-package directory. You will have to find this yourself as the exact location varies depends on installation method.

More information on the mOTUs database can be found here.

ganon custom database

To build a custom ganon database you need two components: the FASTA files you wish to include, and the file extension of those FASTA files.

Tip

You can also use ganon build to download and generate pre-defined databases for you.

You can optionally include your own taxonomy files, however ganon build-custom will download these for you if not provided.

ganon build-custom --threads 4 --input *.fa --db-prefix <YOUR_DB_NAME>

You can then add the <YOUR_DB_NAME>/ path to your nf-core/taxprofiler database input sheet.

Tip

ganon build-custom can be multi-threaded with -t to speed up building.

Expected files in database directory
  • ganon
    • *.ibf or .hibf
    • *.tax

More information on custom ganon database construction can be found here.

KMCP custom database

To build a KMCP database you need four components: the FASTA files you wish to include in gzip-compressed format and one genome per file with the reference identifier in the file name, the taxid mapping file, NCBI taxonomy dump files (names.dmp, nodes.dmp), and the range of k-mers to build the database with.

  1. You need to compute the k-mers with kmcp compute and by providing as input the FASTA files you wish to include.
  2. You need to build index for k-mers with kmcp index by providing as input the output of kmcp compute

For example

kmcp compute -k 21 -n 10 -l 150 -O <OUTDIR_NAME>  <path>/<to>/<fasta>/*.{fna,fa,fasta}
kmcp index -I <OUTDIR_NAME>/ --threads 8 --num-hash 1 --false-positive-rate 0.3 --out-dir <YOUR_DB_NAME>/

You can then add the <YOUR_DB_NAME>/ path to your nf-core/taxprofiler database input sheet.

Expected files in database directory
  • kmcp
    • .unik
    • _info.txt
    • *.kmcp/
    • __db.yml

More information on custom KMCP database construction can be found here.