nf-core/bactmap
A mapping-based pipeline for creating a phylogeny from bacterial whole genome sequences
Introduction
This pipeline maps short reads (usually Illumina) to a bacterial reference (usually of the same species and the closest high quality genome available). For this purpose you will require a set of paired read files (single read file version coming in future version) and a reference genome in fasta format. The read file pairs are specified in a sample sheet. Please note although the pipeline can handle multiple contigs within the reference sequence, it is recommended that plasmid records are removed leaving only the chromosomal records (usually one chromosome in most bacteria) since plasmids are often acquired horizontally and are evolving at a different rate to the chromosome.
Samplesheet input
You will need to create a samplesheet file with information about the samples in your experiment before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the example below.
An example samplesheet has been provided with the pipeline.
Running the pipeline
The typical command for running the pipeline is as follows:
Please note that although the reference is named chromosome.fasta
in this example command it could be named anything BUT it is recommended that the reference only include chromosomal sequence (see note in introduction).
This will launch the pipeline with the docker
configuration profile. See below for more information about profiles.
Note that the pipeline will create the following files in your working directory:
Optional parameters
By default the pipeline does not perform the following steps but they can be enabled by adding the appropriate parameter to the command line:
- trim reads
--trim
- remove recombiination using gubbins
--remove_recombination
- build a RapidNJ tree
--rapidnj
- build a FastTree tree
--fasttree
- build an IQ-TREE tree
--iqtree
- build a RAxML-NG tree
--raxmlng
By default the pipeline subsamples reads to an estimated depth of 100x.
This can be turned off using the --subsampling_off
parameter.
The desired coverage depth can be changed using the subsampling_depth_cutoff
parameter
The steps in the pipeline have sensible defaults but complete control of the arguments passed to the software tools can be achieved by overriding the software arguments found in the modules.config
file with a custom config. For example the default args for IQ-TREE which specify using ModelFinder to find the best fit model could be overridden to specify a specific model. In the modules.config
file these are specified as:
These could be overridden by specifying a config file by adding an argument to the command line such as t -c user.config
(user.config).
Example contents:
This will specify that IQ-TREE should no longer provide SH-aLRT and the ultrafast bootstrap branch support values and will use the GTR+G model. The other options (build
and publish_dir
) will remain the same. Therefore to build an IQ-TREE, this step would either need to be turned on adding the --iqtree
parameter to the Nextflow command line or by adding iqtree = true
to the user config file within the params block.
The steps are described in more detail below along with their default parameters
Comprehensive description the steps
-
Reference sequence indexing: reference sequence is indexed using
bwa index
-
Read trimming (Optional if
params.trim
is set): reads are trimmed usingfastp
. The default process configuration is found in the module.config and can be overridden as described above:Please note that the default adapters are found in the
adapters.fas
file. These can be supplemented by specifying the path to a different file and updating theadapter_fasta
value in a config file specified using-c
as described above. -
Subsample reads (Optionally if
params.subsampling_off
is not set): reads are subsampled based on estimated genome size usingmash sketch
and the required depth of coverage as set by the--subsampling_depth_cutoff
parameter usingrasusa
. The default process configurations are found in the module.config and can be overridden as described above. -
Map reads: reads are mapped to the indexed reference genome using
bwa mem
to produce a bam file. The default process configuration is found in the module.config and can be overridden as described above: -
Sort reads: bam files are sorted using
samtools
-
Call variants: variants are called using
bcftools mpileup
. A minimum base quality of 20 is used for pre-filtering and the following fields are included in the resulting VCF fileFORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR
. A haploid and multiallelic model is assumed. These defaults are found in the module.config and can be overridden as described above: -
Filter variants: variants in the VCF file are filtered using
bcftools filter
. The default process configuration is found in the module.config and can be overridden as described above:After this process a filtered VCF file will be produced that has a row for each position in the reference genome. Each of these will either have a value in the FILTER column of either
PASS
orLowQual
. -
Create pseudogenome: the filtered VCF is used to create a pseudogenome based on the reference genome using the
vcf2pseudogenome.py script
. The base in a sample at a position where the VCF file row that hasPASS
in FILTER will be either ref or alt and the appropriate base will be encoded at that position. The base in a sample at a position where the VCF file row that hasLowQual
in FILTER is uncertain will be encoded as aN
character. Missing data will be encoded as a-
character. -
Align pseudogenomes All samples pseudogenomes and the original reference sequence are concatenated together to produce a flush alignment where all the sequence for all samples at all positions in the original reference sequence will be one of
{G,A,T,C,N,-}
. Only those sequences that are high quality (based on the number of non GATC bases will be included. The threshold for this is set in the default process configuration found in the module.config and can be overridden as described above.This alignment can be used for other downstream processes such as phylogeny generation.
-
Determine number of constant sites Optionally this alignment can be processed to produce a phylogeny as part of this pipeline. The number of constant sites in the alignment will be determined using
snp-sites
. -
Remove recombination (Optionally if
params.remove_recombination
is set): remove regions likely to have been acquired by horizontal transfer and recombination and therefore perturb the true phylogeny usinggubbins
. This should only be run on sets of samples that are closely related and not for example on a set of samples that have diversity spanning that of the entire species. The default process configuration is found in the module.config and can be overridden as described above. -
Build tree(s): Depending on the params set, run 0 - 4 tree building algorithms.
--rapidnj
Build a neighbour-joining pylogeny usingrapidnj
The default process configuration is found in the module.config and can be overridden as described above.
-
--fasttree
Build an approximately-maximum-likelihood phylogeny usingFastTree
The default process configuration is found in the module.config and can be overridden as described above. -
--iqtree
Build a maximum-likelihood phylogeny usingIQ-TREE
The default process configuration is found in the module.config and can be overridden as described above. -
--raxmlng
Build a maximum-likelihood phylogeny usingRAxML Next Generation
The default process configuration is found in the module.config and can be overridden as described above.
Updating the pipeline
When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you’re running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:
Reproducibility
It is a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you’ll be running the same version of the pipeline, even if there have been changes to the code since.
First, go to the nf-core/bactmap releases page and find the latest version number - numeric only (eg. 1.3.1
). Then specify this when running the pipeline with -r
(one hyphen) - eg. -r 1.3.1
.
This version number will be logged in reports when you run the pipeline, so that you’ll know what you used when you look back in the future.
Core Nextflow arguments
NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).
-profile
Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.
Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Conda) - see below. When using Biocontainers, most of these software packaging methods pull Docker containers from quay.io e.g FastQC except for Singularity which directly downloads Singularity images via https hosted by the Galaxy project and Conda which downloads and installs software locally from Bioconda.
We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported.
The pipeline also dynamically loads configurations from https://github.com/nf-core/configs when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the nf-core/configs documentation.
Note that multiple profiles can be loaded, for example: -profile test,docker
- the order of arguments is important!
They are loaded in sequence, so later profiles can overwrite earlier profiles.
If -profile
is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH
. This is not recommended.
docker
- A generic configuration profile to be used with Docker
singularity
- A generic configuration profile to be used with Singularity
podman
- A generic configuration profile to be used with Podman
shifter
- A generic configuration profile to be used with Shifter
charliecloud
- A generic configuration profile to be used with Charliecloud
conda
- A generic configuration profile to be used with Conda. Please only use Conda as a last resort i.e. when it’s not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud.
test
- A profile with a complete configuration for automated testing
- Includes links to test data so needs no other parameters
-resume
Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously.
You can also supply a run name to resume a specific run: -resume [run-name]
. Use the nextflow log
command to show previous run names.
-c
Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information.
Custom configuration
Resource requests
Whilst the default requirements set within the pipeline will hopefully work for most people and with most input data, you may find that you want to customise the compute resources that the pipeline requests. Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with any of the error codes specified here it will automatically be resubmitted with higher requests (2 x original, then 3 x original). If it still fails after the third attempt then the pipeline execution is stopped.
For example, if a pipeline , nf-core/rnaseq
is failing after multiple re-submissions of the STAR_ALIGN
process due to an exit code of 137
this would indicate that there is an out of memory issue. For an in-depth explanation of how to fix these types of errors please see here
Tool-specific options
For the ultimate flexibility, we have implemented and are using Nextflow DSL2 modules in a way where it is possible for both developers and users to change tool-specific command-line arguments (e.g. providing an additional command-line argument to a process) as well as publishing options (e.g. saving files produced by the process that aren’t saved by default by the pipeline). In the majority of instances, as a user you won’t have to change the default options set by the pipeline developer(s), however, there may be edge cases where creating a simple custom config file can improve the behaviour of the pipeline if for example it is failing due to a weird error that requires setting a tool-specific parameter to deal with smaller / larger genomes.
For an in-depth explanation of how to make tool-specific module changes see here
nf-core/configs
In most cases, you will only need to create a custom config as a one-off but if you and others within your organisation are likely to be running nf-core pipelines regularly and need to use the same settings regularly it may be a good idea to request that your custom config file is uploaded to the nf-core/configs
git repository. Before you do this please can you test that the config file works with your pipeline of choice using the -c
parameter. You can then create a pull request to the nf-core/configs
repository with the addition of your config file, associated documentation file (see examples in nf-core/configs/docs
), and amending nfcore_custom.config
to include your custom profile.
See the main Nextflow documentation for more information about creating your own configuration files.
If you have any questions or issues please send us a message on Slack on the #configs
channel.
Running in the background
Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished.
The Nextflow -bg
flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file.
Alternatively, you can use screen
/ tmux
or similar tool to create a detached session which you can log back into at a later time.
Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs).
Nextflow memory requirements
In some cases, the Nextflow Java virtual machines can start to request a large amount of memory.
We recommend adding the following line to your environment to limit this (typically in ~/.bashrc
or ~./bash_profile
):