Reference Genomes
Many nf-core pipelines need a reference genome for alignment, annotation, or similar purposes.
Illumina AWS iGenomes
The transcriptome and GTF files in iGenomes are vastly out of date with respect to current annotations from Ensembl e.g. human iGenomes annotations are from Ensembl release 75, while the current Ensembl release is 108. Please consider downloading and using a more updated version of your reference genome as outlined in the next section.
The GRCh38 iGenomes assembly is from the NCBI and not Ensembl and as such there are some discrepancies in the way that the annotation is defined that may cause problems when running certain pipelines e.g. nf-core/rnaseq#460. If you would like to use the latest soft-masked Ensembl assembly for GRCh38 instead please see the next section.
To make the use of reference genomes easier, Illumina has developed a centralised resource called iGenomes. The most commonly used reference genome files are organised in a consistent structure for multiple genomes.
We have uploaded a copy of iGenomes onto AWS S3 and nf-core pipelines are configured to use this by default. AWS iGenomes is hosted by Amazon as part of the Registry of Open Data and are free to use. For more information about the AWS iGenomes, see https://ewels.github.io/AWS-iGenomes.
All AWS iGenomes paths are specified in pipelines that support them in conf/igenomes.config
. By default, the pipeline will automatically download the required reference files when you run the pipeline and supply an appropriate genome key (eg. --genome GRCh37
). The pipeline will only download what it requires e.g. the nf-core/rnaseq pipeline will download the star
indices specified in conf/igenomes.config
but not the bismark
index because that is something specific to the nf-core/methylseq pipeline. Genome asset related parameters required by nf-core pipelines are typically defined in the main script for DSL2 pipelines. When using AWS iGenomes, for convenience, when a reference asset is available for direct download, these parameters are essentially auto-populated based on what is defined in conf/igenomes.config
when you provide the --genome
parameter. Downloading reference genome files takes time and bandwidth so, if possible, we recommend storing a local copy of your relevant iGenomes references which is outlined here.
To use a local version of iGenomes, the variable params.igenomes_base
must be set to the path of the local iGenomes folder to reflect what is defined in conf/igenomes.config
. Additional information on how to set the local --igenomes_base
parameter can be found here.
To get the version of the annotation used by AWS iGenomes you can download the README
file specified in the conf/igenomes.config
via the AWS CLI e.g.
Custom genomes
As mentioned in the section above, most of the required genome assets will be defined in the main script for DSL2 nf-core pipelines. If you are unable to use the AWS iGenomes references, you can still supply reference genome parameters on the command line or via a -params-file
in yaml
or json
format.
Using GRCh38 as an example, we can download the latest FASTA and GTF files using a simple bash script like below:
Most genomics nf-core pipelines are able to start from just a FASTA and GTF file and create any downstream reference assets as part of the pipeline execution e.g. genome indices, intervals files etc. To avoid having to recreate these assets every time you run the pipeline you can use the --save_reference
parameter that will save the indices, interval files etc in the results directory for you to move and store in a more central location for re-use with future pipeline runs. Using nf-core/rnaseq as an example see docs:
Any downstream reference assets will be published in the results folder. For example, if you ran the nf-core/rnaseq pipeline in the step above with default options then the STAR index will be created and stored in the <RESULTS_DIR>/genome/index/star
folder. Once you have moved the reference files to a central location so they are persistently available you can remove the --save_reference
parameter and now explicitly override the relevant parameters via the CLI or a -params-file
in yaml
or json
format. This will save having to re-create the genome indices and other assets over and over again which will be cost and time expensive.
Using Refgenie for genome management
You can also use the reference genome manager Refgenie with nf-core pipelines.
- Install and initialize refgenie following the official documentation.
A file required by nf-core containing refgenie genome assets will be automatically created at ~/.nextflow/nf-core/refgenie_genomes.config
. An includeConfig
statement to this new config file will be appended to the ~/.nextflow/config
file. This file is automatically loaded by Nextflow during every pipeline run (see the Nextflow documentation).
To use a new reference genome or asset, fetch it via normal refgenie usage (refgenie pull
) - the nf-core plugin will automatically update the refgenie_genomes.config
configuration file.
This file should never be edited manually, as it is overwritten during each refgenie command.
- Pull all the reference assets that you may need to run the pipeline.
Asset paths are automatically added to ~/.nextflow/nf-core/refgenie_genomes.config
, included in ~/.nextflow/config
and available to every pipeline run.
The file format mimics the igenomes.config
file that comes with many nf-core pipelines:
Here, the genome key that you’ll use to launch the pipeline is t7
.
You can also use custom assets.
- Run your pipeline, specifying the required genome.
Please refer to Refgenie documentation for further information.
How to handle Refgenie assets having different aliases than nf-core
A Refgenie server contains assets with established aliases, which can differ from the ones required by an nf-core pipeline. For example, the asset for an ensemble index on the default Refgenie server is called ensembl_gtf, while the same asset is called gtf in nf-core pipelines.
To address this, you can create a file with translations that will be used to generate the genomes configuration file with the appropriate names.
-
When you init Refgenie, you provide a path to a
genomes_config.yml
file with the argument-c
or setting the environment variable$REFGENIE
. In the same directory, create a file calledalias_translations.yaml
. -
alias_translations.yaml
must contain the equivalences of asset aliases in yaml format. Keys correspond to the name of refgenie server aliases while values correspond to the name of the respective nf-core pipeline aliases. For example: -
Pull your assets as usual. The asset aliases will be translated automatically in your
~/.nextflow/nf-core/refgenie_genomes.config
file.