nf-core/differentialabundance
Differential abundance analysis for feature/ observation matrices from platforms such as RNA-seq
Introduction
Differential analysis is a common task in a variety of use cases. In essence, all these use cases entail taking an input matrix containing features (e.g. genes) and observations (e.g. samples), and comparing groups of observations in all or a subset of the features. The feature/ observation language here reflects our hope that this workflow will extend in future to encompass a variety of applications where an assumption of gene vs sample may not be a valid one- though that is the application to which the first release will apply.
With the above in mind, running this workflow requires:
- a set of abundance values. This can be:
- (for RNA-seq or MaxQuant proteomics measurements): a matrix of quantifications with observations by column and features by row
- (for Affymetrix microarrays): a tar’d archive of CEL files
- a description of the observations such as a sample sheet from RNA-seq analysis
- a description of the features, for our initial RNA-seq application this can be simply the GTF file from which gene annotations can be derived. For Affymetrix arrays this can be derived from the array platform annotation package automatically. Skip for MaxQuant. You can also supply your own table.
- a specification of how the matrix should be split, and how the resulting groups should be compared
Observations (samplesheet) input
This may well be the same sample sheet used to generate the input matrix. For example, in RNA-seq this might be the same sample sheet, perhaps derived from fetchngs, that was input to the RNA-seq workflow. It may be necessary to add columns that describe the groups you want to compare. The columns that the pipeline requires are:
- a column listing the sample IDs (must be the same IDs as in the abundance matrix), in the example below it is called ‘sample’. For some study_types, this column might need to be filled in with file names, e.g. when doing an affymetrix analysis.
- one or more columns describing conditions for the differential analysis. In the example below it is called ‘condition’
- optionally one or more columns describing sample batches or similar which you want to be considered in the analysis. In the example below it is called ‘batch’
For example:
The file can be tab or comma separated.
Affymetrix arrays
Abundances for Affy arrays are provided in CEL files within an archive. When creating sample sheets for Affy arrays, it’s crucial to include a column that specifies which file corresponds to each sample. This file column is essential for linking each sample to its corresponding data file, as shown in the example below:
The “file” column in this example is used to specify the data file associated with each sample, which is essential for data analysis and interpretation.
Abundance values
RNA-seq and similar
This is a numeric square matrix file, comma or tab-separated, with a column for every observation, and features corresponding to the supplied feature set. The parameters --observations_id_col
and --features_id_col
define which of the associated fields should be matched in those inputs.
Outputs from nf-core/rnaseq and other tximport-processed results
The nf-core RNAseq workflow incorporates tximport for producing quantification matrices. From version 3.12.2, it additionally provides transcript length matrices which can be directly consumed by DESeq2 to model length bias across samples.
To use this approach, include the transcript lengths file with the raw counts:
Without the transcript lengths, for instance in earlier rnaseq workflow versions, follow the second recommendation in the tximport documentation:
“Use the tximport argument
countsFromAbundance='lengthScaledTPM'
or'scaledTPM'
, then employ the gene-level count matrixtxi$counts
directly in downstream software, a method we call ‘bias corrected counts without an offset’”
This aligns with the gene_counts_length_scaled.tsv or gene_counts_scaled.tsv matrices in the rnaseq workflow.
It is important to note that the documentation advises:
“Do not manually pass the original gene-level counts to downstream methods without an offset.”
So we do not recommend raw counts files such as salmon.merged.gene_counts.tsv
as input for this workflow except where the transcript lengths are also provided.
MaxQuant intensities
This is the proteinGroups.txt file produced by MaxQuant. It is a tab-separated matrix file with a column for every observation (plus additional columns for other types of measurements and information); each row contains these data for a set of proteins. The parameters --observations_id_col
and --features_id_col
define which of the associated fields should be matched in those inputs. The parameter --proteus_measurecol_prefix
defines which prefix is used to extract those matrix columns which contain the measurements to be used. For example, the default LFQ intensity
will indicate that columns like LFQ intensity S1, LFQ intensity S2, LFQ intensity S3 etc. are used (one whitespace is automatically added if necessary).
Affymetrix microarrays
This is an archive of CEL files as frequently found in GEO.
Use SOFT matrices
Alternatively, the user may want to work with SOFT matrices. In this case, setting
--study_type geo_soft_file
and --querygse [GSE study ID]
enables the pipeline to download normalised SOFT matrices automatically (note that even though Affymetrix arrays are also supported in the SOFT matrix track, it is recommended to work from CEL files in this case).
As for other platforms You may subset the metadata features used in reporting etc. e.g. for GPL570 (Affymetrix Plus 2.0 arrays) this could be done with
Full list of features metadata are available on GEO platform pages.
Contrasts file
The contrasts file references the observations file to define groups of samples to compare. For example, based on the sample sheet above we could define contrasts like:
The necessary fields in order are:
id
- an arbitrary identifier, will be used to name contrast-wise output filesvariable
- which column from the observations information will be used to define groupsreference
- the base/ reference level for the comparison. If features have higher values in this group than target they will generate negative fold changestarget
- the target/ non-reference level for the comparison. If features have higher values in this group than the reference they will generate positive fold changes
You can optionally supply:
blocking
- semicolon-delimited, any additional variables (also observation columns) that should be modelled alongside the contrast variableexclude_samples_col
andexclude_samples_values
- the former being a valid column in the samples sheet, the latter a semicolon-delimited list of values in that column which should be used to select samples prior to differential modelling. This is helpful where certain samples need to be exluded prior to analysis of a given contrast.
The file can be tab or comma separated.
Feature annotations
GTF file
This is usually the easiest way to supply annotations for RNA-seq features. It should match the GTF used in nf-core/rnaseq if that workflow was used to produce the input expression matrix. Skip for MaxQuant.
Annotation package identifiers for Affymetrix arrays
For -profile affy
, default behaviour is to derive an annotation table while running the affy/justrma module based on the CDF name discovered there.
Your own features, or no features
To override the above options, you may also supply your own features table as a TSV:
By default, if you don’t provide features, for non-array data the workflow will fall back to attempting to use the matrix itself as a source of feature annotations. For this to work you must make sure to set the features_id_col
, features_name_col
and features_metadata_cols
parameters to the appropriate values, for example by setting them to ‘gene_id’ if that is the identifier column on the matrix. This will cause the gene ID to be used everywhere rather than more accessible gene symbols (as can be derived from the GTF), but the workflow should run. Please use this option for MaxQuant analysis, i.e. do not provide features.
Working with the output R markdown file
The pipeline produces an R markdown file which, if you’re proficient in R, you can use to tweak the report after it’s generated (note- if you need the same customisations repeatedly we would recommend you supply your own template using the report_file
parameter).
To work with R markdown files you will need Rstudio. You will also need to have the ShinyNGS R module installed, since it supplies a lot of the accessory plotting functions etc that you will need. The exact way you will do this may depend on your exact systems, but for example
1. Create a conda environment with Shinyngs and activate it
2. Open RStudio from this environment
For example, on a Mac Terminal:
Now, unzip the report archive, and in RStudio change directory to that location:
Now open the R Markdown file from the RStudio UI, and you should have everything you need to run the various code segments and render the whole document to HTML again if you wish.
Shiny app generation
The pipeline is capable of building, and even deploying (to shinyapps.io) for you a Shiny app built with ShinyNGS. There is a basic example running here which shows what this might look like.
This is enabled with:
… which is the default. By default the app is not deployed, but just output to the output folder under shinyngs_app/[study_name]
.
You have 3 choices in running that application:
- Run locally
- Have shinyapps.io host it for you
- Host on a Shiny server
1. Run locally
You can start the application locally (in an environment where ShinyNGS is installed) like:
This will give you a local URI to access in your browser:
2. Shinyapps.io deployment
shinyapps.io is a hosting solution supplied by Posit (formerly RStudio) which gives you quick and easy access to hosting for Shiny applications. There is a free tier, though you’ll have to pay for features such as authentication and improved resources.
You can upload your app to shinyapps.io youself, or deploy directly to shinyapps.io with this workflow, for which a few things need to happen:
Account and app setup
At https://www.shinyapps.io/, create an account, add a token (via Account -> Tokens) and note your secret and token.
You let Nextflow know about these via secrets:
Configuration
You then need to activate the deployment in your parameters, and supply both your account name and an app name:
With this configuration in place deployment should happen automatically every time you run your workflow.
3. Run your own Shiny server
There is also a Shiny server application, which you can install on your own infrastruture and use to host applications yourself.
Gene set enrichment analysis
Currently, two tools can be used to do gene set enrichment analysis.
GSEA
GSEA tests for differential genes from within a user-provided set of genes; this requires a GMT or GMX file. The following example shows how to enable this:
g
The gprofiler2 package can be used to test which pathways are enriched in the sets of differential genes produced by the the DESeq2 or limma modules. It is an R interface for the g
webtool. In the simplest form, this feature can be enabled with the parameters from the following example:If gene sets have been specified to the workflow via --gene_sets_files
these are used by default. Specifying --gprofiler2_organism
(mmusculus for Mus musculus, hsapiens for Homo sapiens etc.) will override those gene sets with g
--gprofiler2_token
will override both options and use gene sets from a previous g run.
By default the analysis will be run with a background list of genes that passed the abundance filter (i.e. those genes that actually had some expression); see for example https://doi.org/10.1186/s13059-015-0761-7 for why this is advisable. You can provide your own background list with --gprofiler2_background_file background.txt
or if you want to not use any background, set --gprofiler2_background_file false
.
Check the pipeline webpage for a full listing of the relevant parameters.
Running the pipeline
The typical command for running the pipeline is as follows:
This will launch the pipeline with the docker
configuration profile. See below for more information about profiles.
Note that the pipeline will create the following files in your working directory:
Hints and tips
- If you don’t like the colors used in the report, try a different
RColorBrewer
palette by changing theexploratory_palette_name
and/ordifferential_palette_name
parameters. - In rare cases, some users have reported issues with DESeq2 using all available cores on a machine, rather than those specified in the process configuration. This can be prevented by setting the
OPENBLAS_NUM_THREADS
environment variable.
Scaling up to large sample numbers
Deactivating reporting processes
A number of workflow steps are not optimised to deal with large sample numbers and will cause the overall workflow to fail. If you have sample numbers on the order of 100s or more, you should disable these processes like:
You will not get the final reporting outcomes of the workflow, but you will get the differential tables produced by DESeq2 or Limma, and the results of any gene seta analysis you have enabled.
Restricting samples considered by DESeq2 or Limma
By default, the DESeq2 or Limma differential modules model all samples at once, rather than just the samples involved in the contrast. This is usually the correct thing to do, but when there are are large numbers of samples involved in each contrast it may be unnecessary, and things can be sped up significantly by setting --differential_subset_to_contrast_samples
. This will remove any samples not relevant to the contrast before the main differential analysis routines are called.
Params files
If you wish to repeatedly use the same parameters for multiple runs, rather than specifying each flag in the command, you can specify these in a params file.
Pipeline settings can be provided in a yaml
or json
file via -params-file <file>
.
Do not use -c <file>
to specify parameters as this will result in errors. Custom config files specified with -c
must only be used for tuning process resource specifications, other infrastructural tweaks (such as output directories), or module arguments (args).
The above pipeline run specified with a params file in yaml format:
with params.yaml
containing:
You can also generate such YAML
/JSON
files via nf-core/launch.
Updating the pipeline
When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you’re running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:
Reproducibility
It is a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you’ll be running the same version of the pipeline, even if there have been changes to the code since.
First, go to the nf-core/differentialabundance releases page and find the latest pipeline version - numeric only (eg. 1.3.1
). Then specify this when running the pipeline with -r
(one hyphen) - eg. -r 1.3.1
. Of course, you can switch to another version by changing the number after the -r
flag.
This version number will be logged in reports when you run the pipeline, so that you’ll know what you used when you look back in the future. For example, at the bottom of the MultiQC reports.
To further assist in reproducibility, you can use share and re-use parameter files to repeat pipeline runs with the same settings without having to write out a command with every single parameter.
If you wish to share such profile (such as upload as supplementary material for academic publications), make sure to NOT include cluster specific paths to files, nor institutional specific profiles.
Core Nextflow arguments
These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).
-profile
Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.
Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Apptainer, Conda) - see below.
We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported.
The pipeline also dynamically loads configurations from https://github.com/nf-core/configs when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the nf-core/configs documentation.
Note that multiple profiles can be loaded, for example: -profile test,docker
- the order of arguments is important!
They are loaded in sequence, so later profiles can overwrite earlier profiles.
If -profile
is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH
. This is not recommended, since it can lead to different results on different machines dependent on the computer enviroment.
test
- A profile with a complete configuration for automated testing
- Includes links to test data so needs no other parameters
docker
- A generic configuration profile to be used with Docker
singularity
- A generic configuration profile to be used with Singularity
podman
- A generic configuration profile to be used with Podman
shifter
- A generic configuration profile to be used with Shifter
charliecloud
- A generic configuration profile to be used with Charliecloud
apptainer
- A generic configuration profile to be used with Apptainer
conda
- A generic configuration profile to be used with Conda. Please only use Conda as a last resort i.e. when it’s not possible to run the pipeline with Docker, Singularity, Podman, Shifter, Charliecloud, or Apptainer.
-resume
Specify this when restarting a pipeline. Nextflow will use cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. For input to be considered the same, not only the names must be identical but the files’ contents as well. For more info about this parameter, see this blog post.
You can also supply a run name to resume a specific run: -resume [run-name]
. Use the nextflow log
command to show previous run names.
-c
Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information.
Custom configuration
Resource requests
Whilst the default requirements set within the pipeline will hopefully work for most people and with most input data, you may find that you want to customise the compute resources that the pipeline requests. Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with any of the error codes specified here it will automatically be resubmitted with higher requests (2 x original, then 3 x original). If it still fails after the third attempt then the pipeline execution is stopped.
For example, if the nf-core/differentialabundance pipeline is failing after multiple re-submissions of the DESEQ2_DIFFERENTIAL
process due to an exit code of 137
this would indicate that there is an out of memory issue:
To change the resource requests, please see the max resources and tuning workflow resources section of the nf-core website.
Custom Containers
In some cases you may wish to change which container or conda environment a step of the pipeline uses for a particular tool. By default nf-core pipelines use containers and software from the biocontainers or bioconda projects. However in some cases the pipeline specified version maybe out of date.
To use a different container from the default container or conda environment specified in a pipeline, please see the updating tool versions section of the nf-core website.
Custom Tool Arguments
A pipeline might not always support every possible argument or option of a particular tool used in pipeline. Fortunately, nf-core pipelines provide some freedom to users to insert additional parameters that the pipeline does not include by default.
To learn how to provide additional arguments to a particular tool of the pipeline, please see the customising tool arguments section of the nf-core website.
nf-core/configs
In most cases, you will only need to create a custom config as a one-off but if you and others within your organisation are likely to be running nf-core pipelines regularly and need to use the same settings regularly it may be a good idea to request that your custom config file is uploaded to the nf-core/configs
git repository. Before you do this please can you test that the config file works with your pipeline of choice using the -c
parameter. You can then create a pull request to the nf-core/configs
repository with the addition of your config file, associated documentation file (see examples in nf-core/configs/docs
), and amending nfcore_custom.config
to include your custom profile.
See the main Nextflow documentation for more information about creating your own configuration files.
If you have any questions or issues please send us a message on Slack on the #configs
channel.
Azure Resource Requests
To be used with the azurebatch
profile by specifying the -profile azurebatch
.
We recommend providing a compute params.vm_type
of Standard_D16_v3
VMs by default but these options can be changed if required.
Note that the choice of VM size depends on your quota and the overall workload during the analysis. For a thorough list, please refer the Azure Sizes for virtual machines in Azure.
Running in the background
Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished.
The Nextflow -bg
flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file.
Alternatively, you can use screen
/ tmux
or similar tool to create a detached session which you can log back into at a later time.
Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs).
Nextflow memory requirements
In some cases, the Nextflow Java virtual machines can start to request a large amount of memory.
We recommend adding the following line to your environment to limit this (typically in ~/.bashrc
or ~./bash_profile
):