Tutorials
This page provides a range of tutorials to help give you a bit more guidance on how to set up nf-core/taxprofiler runs in the wild.
Simple Tutorial
In this tutorial we will run you through a simple set up of a small nf-core/taxprofiler run. It assumes that you have basic knowledge of metagenomic classification input and output files.
Preparation
Hardware
The datasets used should be small enough to run on your own laptop or a single server node.
If you wish to use a HPC cluster or cloud, and don’t wish to use an ‘interactive’ session submitted to your scheduler, please see the nf-core documentation on how to make a relevant config file.
You will need internet access and at least 1.5 GB of hardrive space.
Software
The tutorial assumes you are on a Unix based operating system, and have already installed Nextflow as well a software environment system such as Conda, Docker, or Singularity/Apptainer.
The tutorial will use Docker, however you can simply replace references to docker
with conda
, singularity
, or apptainer
accordingly.
Data
First we will make a directory to run the whole tutorial in.
We will use very small short-read (pre-subset) metagenomes used for testing. nf-core/taxprofiler accepts FASTQ or FASTA files as input formats, however we will use FASTQ here as the more common format in taxonomic classification. You can download these metagenomes with the following command.
In this tutorial we will demonstrate running with three different profilers, and in one of those cases, running the same database twice but with different parameters. The database consists of two genomes of species known to be present in the metagenomes. You can download the databases for Kraken2, Centrifuge, and Kaiju with the following commands.
To demonstrate that nf-core/taxprofiler can also accept databases as uncompressed folders, we can extract one of them.
You must provide these databases pre-built to the pipeline, nf-core/taxprofiler neither comes with default databases not can generate databases for you. For guidance on how to build databases, see the Retrieving databases or building custom databases tutorial.
Finally, an important step of any metagenomic classification is to remove contamination. Contamination can come from many places, typically from the host of a host-associated sample, however this can also come from laboratory processing samples. A common contaminant in Illumina sequencing is a spike-in control of the genome of PhiX virus, which we can download with the following command.
Preparing Input
Sample sheet
You provide the sequencing data FASTQ files to nf-core/taxprofiler via a input ‘sample sheet’ .csv
file.
This is a 6 column table, that includes sample and library names, instrument platform, and paths to the sequencing data.
Open a text editor, and create a file called samplesheet.csv
.
Copy and paste the following lines into the file and save it.
Here we have specified two libraries of the same sample, that they were sequencing on Illumina platforms, and the paths to the FASTQ files.
If you had placed your FASTQ files elsewhere, you would give the full path (i.e., with relevant directories) to the fastq_1
, fastq_2
and fasta
columns.
Database sheet
For the database(s), you also supply these via a .csv
file.
This 4 (or 5) column table contains the tool the database has been built for, a database name, the parameters you wish reads to be queried against the given database with, an optional column to distinguish between short- and long-read databases, and a path to a .tar.gz
archive file or a directory containing the database files.
Open a text editor, and create a file called database.csv
.
Copy and paste the following csv file into the file and save it.
You can see here we have specified the Centrifuge database twice, to allow comparison of different settings.
We have also specified different profiling parameters depending on whether a database is for short-read or long-read use.
If we don’t specify this, the pipeline will assume all databases (and their settings specified in db_params
!) will be applicable for both short and long read data.
Note that the each database of the same tool has a unique name.
Furthermore, while the Kraken2 and Centrifuge databases have been supplied as .tar.gz
archives, the Kaiju database has been supplied as a directory.
Running the pipeline
Now that we have the sequencing reads (in FASTQ format), the databases (directory or .tar.gz
), and a reference genome (FASTA, optionally gzipped), we can now run them with the pipeline. The following command will perform short read quality control, remove contaminant reads, merge multiple libraries for each sample, run the three profilers, and finally generate standardised profiles.
With all Docker containers pre-downloaded, this run took 2 minutes and 31 seconds on a laptop running Ubuntu 22.04.2 with 32 GB RAM and 16 CPUs. If you are running nf-core/taxprofiler for the first time, expect this command to take longer as Nextflow will have to download each software container for each step of the pipeline.
To break down each line of the command:
- Tell Nextflow to run nf-core/taxprofiler with the particular version and using the Docker container system
- Specify the input and outputs, i.e., paths to the
samplesheet.csv
,database.csv
, and directory where to save the results - Turn on basic quality control of input reads: adapter clipping, length filtering, etc
- Turn on the removal of host or contaminant reads, and specify the path to reference genome of this
- Turn on run merging, i.e., combine the processed input reads of the multiple libraries into each sample, and save these reads (e.g. for downstream use)
- Turn on the different taxonomic profiling tools you wish to use
- Turn on profile standardisation and multi-sample taxon tables
- (Optional) provide a cap to the maximum amount of resources each step/job of the pipeline can use
The pipeline runs occasionally fail due to a particular step of the pipeline requesting more resources than you have on your system. To avoid these failures, you can tell Nextflow to set a cap pipeline-step resource requests against a list called resourceLimits
specified in Nextflow config file. These should represent the maximum possible resources of a machine or node. To learn how to increase computational resource to the pipeline, see the central nf-core documentation.
The pipeline run can be represented (in a simplified format!) as follows
Loading graph
We hope you see the benefit of using pipelines for such a task!
Output
In the resulting directory results/
you will find a range of directories.
To follow the same order as the command construction above
- Pipeline run report is found in
multiqc/
and resource statistics inpipeline_info
- Short-read QC results are found in
fastqc/
andfastp/
- Host/contaminant removal results are found in
bowtie2/
andsamtools/
- Lane merged preprocessed reads are found in
run_merging/
- Raw profiling results are found in
kraken2/
,centrifuge/
, andkaiju/
- Standardised profiles for all profiling tools and databases are found in
taxpasta
Within each classifier results directory, there will be one directory and ‘combined samples table’ per database.
For read-preprocessing steps, only log files are stored in the results/
directories by default. Refer to the parameters tab of the nf-core/taxprofiler documentation for more options.
The general ‘workflow’ of going through the results will typically be reviewing the multiqc/multiqc_report.html
file to get general statistics of the entire run, particularly of the preprocessing.
You would then use the taxon tables in the taxpasta/
directory for downstream analysis, but referring to the classifier specific results directories when you require more detailed information on each classification.
Detailed descriptions of all results files can be found in the output tab of the nf-core/taxprofiler documentation.
Clean up
Once you have completed the tutorial, you can run the following command to delete all downloaded and output files.
Don’t forget to change out of the directory above before trying to delete it!
Retrieving databases or building custom databases
Not all taxonomic profilers provide ready-made or default databases. Here we will give brief guidance on how to build custom databases for each supported taxonomic profiler.
You should always consult the documentation of each tool for more information, as here we only provide short minimal-tutorials as quick reference guides (with no guarantee they are up to date).
The following tutorials assumes you already have the tool available (e.g. installed locally, or via conda, docker etc.), and you have already downloaded the FASTA files you wish to build into a database.
Bracken custom database
Bracken does not require an independent database but rather builds upon Kraken2 databases. The pre-built Kraken2 databases hosted by Ben Langmead already contain the required files to run Bracken.
However, to build custom databases, you will need a Kraken2 database, the (average) read lengths (in bp) of your sequencing experiment, the K-mer size used to build the Kraken2 database, and Kraken2 available on your machine.
You can speed up database construction by supplying the threads parameter (-t
).
If you do not have Kraken2 in your $PATH
you can point to the binary with -x /<path>/<to>/kraken2
.
Expected files in database directory
bracken
hash.k2d
opts.k2d
taxo.k2d
database100mers.kmer_distrib
database150mers.kmer_distrib
You can follow Bracken tutorial for more information.
Centrifuge custom database
To build a custom Centrifuge database, a user needs to download taxonomy files, make a custom seqid2taxid.map
and combine the fasta files together.
In total, you need four components: a tab-separated file mapping sequence IDs to taxonomy IDs (--conversion-table
), a tab-separated file mapping taxonomy IDs to their parents and rank, up to the root of the tree (--taxonomy-tree
), a pipe-separated file mapping taxonomy IDs to a name (--name-table
), and the reference sequences.
An example of custom seqid2taxid.map
:
Expected files in database directory
centrifuge
<database_name>.<number>.cf
<database_name>.<number>.cf
<database_name>.<number>.cf
<database_name>.<number>.cf
For the Centrifuge custom database documentation, see here.
DIAMOND custom database
To create a custom database for DIAMOND, the user should download and unzip the NCBI’s taxonomy files and the input FASTA files.
The download and build steps are as follows:
Expected files in database directory
diamond
<database_name>.dmnd
A detailed description can be found here.
Kaiju custom database
A number of kaiju pre-built indexes for reference datasets are maintained by the developers of kaiju and made available on the kaiju website. These databases can directly be used to run the workflow with Kaiju.
In case the databases above do not contain your desired libraries, you can build a custom kaiju database. To build a kaiju database, you need three components: a FASTA file with the protein sequences, the NCBI taxonomy dump files, and you need to define the uppercase characters of the standard 20 amino acids you wish to include.
The headers of the protein fasta file must be numeric NCBI taxon identifiers of the protein sequences.
To download the NCBI taxonomy files, please run the following commands:
To build the database, run the following command (the contents of taxdump must be in the same location where you run the command):
You can speed up database construction by supplying the threads parameter (-t
).
Expected files in database directory
kaiju
kaiju_db_*.fmi
nodes.dmp
names.dmp
For the Kaiju database construction documentation, see here.
Kraken2 custom database
A number of database indexes have already been generated and maintained by @BenLangmead Lab, see here. These databases can directly be used to run the workflow with Kraken2 as well as Bracken.
In case the databases above do not contain your desired libraries, you can build a custom Kraken2 database. This requires two components: a taxonomy (consisting of names.dmp
, nodes.dmp
, and *accession2taxid
) files, and the FASTA files you wish to include.
To pull the NCBI taxonomy, you can run the following:
You can then add your FASTA files with the following build command.
You can repeat this step multiple times to iteratively add more genomes prior building.
Once all genomes are added to the library, you can build the database (and optionally clean it up):
You can then add the <YOUR_DB_NAME>/
path to your nf-core/taxprofiler database input sheet.
Expected files in database directory
kraken2
opts.k2d
hash.k2d
taxo.k2d
You can follow the Kraken2 tutorial for a more detailed description.
KrakenUniq custom database
For any KrakenUniq database, you require: taxonomy files, the FASTA files you wish to include, a seqid2mapid
file, and a k-mer length.
First you must make a seqid2taxid.map
file which is a two column text file containing the FASTA sequence header and the NCBI taxonomy ID for each sequence:
Then make a directory (<DB_DIR_NAME>/
), containing the seqid2taxid.map
file, and your FASTA files in a subdirectory called library/
(these FASTA files can be symlinked). You must then run the taxonomy
command on the <DB_DIR_NAME>/
directory, and then build it.
You can speed up database construction by supplying the threads parameter (--threads
) to krakenuniq-build
.
Expected files in database directory
krakenuniq
opts.k2d
hash.k2d
taxo.k2d
database.idx
taxDB
Please see the KrakenUniq documentation for more information.
MALT custom database
To build a MALT database, you need the FASTA files to include, and an (unzipped) MEGAN mapping ‘db’ file for your FASTA type. In addition to the input directory, output directory, and the mapping file database, you also need to specify the sequence type (DNA or Protein) with the -s
flag.
You can then add the <YOUR_DB_NAME>/
path to your nf-core/taxprofiler database input sheet.
MALT generates very large database files and requires large amounts of RAM. You can reduce both by increasing the step size -st
(with a reduction in sensitivity).
MALT-build can be multi-threaded with -t
to speed up building.
Expected files in database directory
malt
ref.idx
taxonomy.idx
taxonomy.map
index0.idx
table0.idx
table0.db
ref.inf
ref.db
taxonomy.tre
See the MALT manual for more information.
MetaPhlAn custom database
MetaPhlAn does not allow (easy) construction of custom databases. Therefore we recommend to use the prebuilt database of marker genes that is provided by the developers.
To perform this task, ensure that you have installed MetaPhlAn
on your machine. Keep in mind that each version of MetaPhlAn aligns with a specific version of the database. Therefore, if you download the MetaPhlAn3 database, remember to include --mpa3
as a parameter for the database in the --databases
CSV file.
You can then add the <YOUR_DB_NAME>/
path to your nf-core/taxprofiler database input sheet.
It is generally not recommended to modify this database yourself, thus this is currently not supported in the pipeline. However, it is possible to customise the existing database by adding your own marker genomes following the instructions here.
If using your own database is relevant for you, please contact the nf-core/taxprofiler developers on the nf-core slack and we will investigate supporting this.
Expected files in database directory
metaphlan4
mpa_vJan21_TOY_CHOCOPhlAnSGB_202103.pkl
mpa_vJan21_TOY_CHOCOPhlAnSGB_202103.fna.bz2
mpa_vJan21_TOY_CHOCOPhlAnSGB_202103.1.bt2l
mpa_vJan21_TOY_CHOCOPhlAnSGB_202103.2.bt2l
mpa_vJan21_TOY_CHOCOPhlAnSGB_202103.3.bt2l
mpa_vJan21_TOY_CHOCOPhlAnSGB_202103.4.bt2l
mpa_vJan21_TOY_CHOCOPhlAnSGB_202103.rev.1.bt2l
mpa_vJan21_TOY_CHOCOPhlAnSGB_202103.rev.2.bt2l
mpa_latest
More information on the MetaPhlAn database can be found here.
mOTUs custom database
mOTUs does not provide the ability to construct custom databases. Therefore we recommend to use the the prebuilt database of marker genes provided by the developers.
Do not change the directory name of the resulting database if moving to a central location The database name of db_mOTU/
is hardcoded in the mOTUs tool
To do this you need to have mOTUs
installed on your machine.
Then supply the db_mOTU/
path to your nf-core/taxprofiler database input sheet.
The db_mOTU/
directory may be downloaded to somewhere in your Python’s site-package
directory. You will have to find this yourself as the exact location varies depends on installation method.
More information on the mOTUs database can be found here.
ganon custom database
To build a custom ganon database you need two components: the FASTA files you wish to include, and the file extension of those FASTA files.
You can also use ganon build
to download and generate pre-defined databases for you.
You can optionally include your own taxonomy files, however ganon build-custom
will download these for you if not provided.
You can then add the <YOUR_DB_NAME>/
path to your nf-core/taxprofiler database input sheet.
ganon build-custom
can be multi-threaded with -t
to speed up building.
Expected files in database directory
ganon
*.ibf
or.hibf
*.tax
More information on custom ganon database construction can be found here.
KMCP custom database
To build a KMCP database you need four components: the FASTA files you wish to include in gzip-compressed format and one genome per file with the reference identifier in the file name, the taxid mapping file, NCBI taxonomy dump files (names.dmp, nodes.dmp), and the range of k-mers to build the database with.
- You need to compute the k-mers with
kmcp compute
and by providing as input the FASTA files you wish to include. - You need to build index for k-mers with
kmcp index
by providing as input the output ofkmcp compute
For example
You can then add the <YOUR_DB_NAME>/
path to your nf-core/taxprofiler database input sheet.
Expected files in database directory
kmcp
.unik
_info.txt
*.kmcp/
__db.yml
More information on custom KMCP database construction can be found here.