nf-core/phaseimpute
A bioinformatics pipeline to phase and impute genetic data
Introduction
The nf-core/phaseimpute pipeline is designed to perform genomic phasing and imputation techniques. Some key functionalities include chromosome checking, panel preparation, imputation, simulation, and concordance.
Samplesheet input
You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use the --input
parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below.
Structure
The samplesheet can have as many columns as you desire. However, there is a strict requirement for at least 3 columns to match those defined in the table below.
A final samplesheet file may look something like the one below. This is for 6 samples.
Column | Description |
---|---|
sample | Custom sample name. Spaces in sample names are automatically converted to underscores (_ ). |
file | Full path to an alignment or variant file. File has to have the extension “.bam”, “.cram” or “.vcf”, “.bcf” optionally compressed with bgzip “.gz”. All files need to have the same extension. |
index | Full path to index file. File has to be have the extension “.bai”, “.crai”, “csi”, or “tbi”. All files need to have the same extension. |
An example samplesheet has been provided with the pipeline.
Samplesheet reference panel
You will need to create a samplesheet with information about the reference panel you would like to use. Use the --panel
parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below.
Structure
A final samplesheet file for the reference panel may look something like the one below. This is for 3 chromosomes.
Column | Description |
---|---|
panel | Name of the reference panel used. |
chr | Name of the chromosome. Use the prefix ‘chr’ if the panel uses the prefix. |
vcf | Full path to a VCF file for that chromosome. File has to be gzipped and have the extension “.vcf.gz”. |
index | Full path to the index for VCF file for that chromosome. File has to be gzipped and have the extension “.tbi”. |
An example samplesheet has been provided with the pipeline.
Samplesheet posfile
You will need a samplesheet with information about the reference panel sites for using the --steps [impute,validate]
. You can generate this samplesheet from --steps panelprep
. Use the --posfile
parameter to specify its location. It has to be a comma-separated file with at least 5 columns, and a header row as shown in the examples below.
Structure
A final samplesheet file for the posfile may look something like the one below. This is for 2 chromosomes.
Column | Description |
---|---|
panel | Name of the reference panel used. |
chr | Name of the chromosome. Use the prefix ‘chr’ if the panel uses the prefix. |
vcf | Full path to a VCF containing the sites for that chromosome. File has to be gzipped and have the extension “.vcf.gz”. (Required for validation step) |
index | Full path to the index for the VCF file for that chromosome. File has to be gzipped and have the extension “.tbi”. (Necessary for validation step) |
hap | Full path to “.hap.gz” compressed file containing the reference panel haplotypes in “haps” format. (Required by QUILT) |
legend | Full path to “.legend.gz” compressed file containing the reference panel sites in “legend” format. (Required by QUILT, GLIMPSE1 and STITCH) |
The legend
file should be a TSV with the following structure, similar to that from bcftools convert
documentation with the --haplegendsample
command : File is space separated with a header (“id,position,a0,a1”), one row per SNP, with the following columns:
- Column 1: chromosome allele_alternate allele
- Column 2: physical position (sorted from smallest to largest)
- Column 3: reference base
- Column 4: alternate base
Reference genome
Remember to use the same reference genome for all the files. You can specify the reference genome using:
or you can specify a custom genome using:
Running the pipeline
A quick running example only with the imputation step can be performed as follows:
The typical command for running the pre-processing of the panel and imputation of samples is shown below:
This will launch the pipeline, preparing the reference panel and performing imputation, with the docker
configuration profile. See below for more information about profiles.
Note that the pipeline will create the following files in your working directory:
To facilitate multiple runs of the pipeline with consistent settings without specifying each parameter in the command line, you can use a parameter file. This allows for setting parameters once and reusing them across different executions.
You can provide pipeline settings in a yaml
or json
file, which can be specified using the -params-file
option:
Example of a params.yaml
file:
Do not use -c <file>
to specify parameters as this will result in errors. Custom config files specified with -c
must only be used for tuning process resource specifications, other infrastructural tweaks (such as output directories), or module arguments (args).
You can also generate YAML
or JSON
files easily using the nf-core/launch tool, which guides you creating the files that can be used directly with -params-file
.
Check of the contigs name
The pipeline parallelize the imputation process across contigs. To do so it will use either the --regions
samplesheet or the .fai
to extract the genomic region to process.
From all those contigs some might not be present in the --panel
, --posfile
, --chunks
, --map
(column chr
) or in the --fasta
. In this case the pipeline will warn you that some of the contigs are absent in some of the file specified and will only parallelize on the intersection of all contigs.
Afterwards the remaining contigs presence will be checked with the CHECKCHR
pipeline to ensure that they are present in each --input
and --input_truth
file and that also in the individuals reference panel files.
Running the pipeline
nf-core/phaseimpute can be started at different points in the analysis by setting the flag --steps
and the available options [simulate, panelprep, impute, validate, all]
. You can also run several steps simultaneously by listing the required processes as --steps panelprep,impute
or you can choose to run all steps sequentially by using --steps all
.
Start with simulation --steps simulate
This steps of the pipeline allows to create synthetic low-coverage input files by downsizing high density input data. A typical use case is to obtain low-coverage input data from a sequenced sample. This method is useful for comparing the imputation results to the truth and evaluate the quality of the imputation. You can skip this steps if you already have low-pass genome sequencing data. A sample command for this steps is:
The required flags for this mode are:
--steps simulate
: The steps to run.--input samplesheet.csv
: The samplesheet containing the input sample files inbam
orcram
format.--depth
: The final depth of the file [default: 1].--genome
or--fasta
: The reference genome of the samples.
You can find an overview of the results produced by this step in the Output.
Start with panel preparation --steps panelprep
This steps pre-processes the reference panel in order to be ready for imputation. There are a few quality control steps that are applied to reference panels. These include actions such as removing multiallelic SNPs and indels and removing certain samples from the reference panel (such as related samples). In addition, chunks are produced which are then used in the imputation steps. It is recommended that this steps is run once and the produced files are saved, to minimize the cost of reading the reference panel each time. Then, the output files from --steps panelprep
can be used as input in the subsequent imputation steps, such as --steps impute
.
For starting from panel preparation, the required flags are --steps panelprep
and --panel samplesheet_reference.csv
.
The required flags for this mode are:
--steps panelprep
: The steps to run.--panel reference.csv
: The samplesheet containing the reference panel files invcf.gz
format.--phase
: (optional) Whether the reference panel should be phased (true|false).--normalize
: (optional) Whether the reference panel needs to be normalized or not (true|false). Default is true.--remove_samples
: (optional) A comma-separated list of samples to remove from the reference during the normalization process.--compute_freq
: (optional) Whether the frequency (AC/AN field) for each variants needs to be computed or not (true/false). This can be the case if the frequency is absent from the reference panel or if individuals have been removed.
You can find an overview of the results produced by this steps in the Output.
Start with imputation --steps impute
For starting from the imputation steps, the required flags are:
-
--steps impute
-
--input input.csv
: The samplesheet containing the input sample files inbam
,cram
orvcf
,bcf
format. -
--genome
or--fasta
: The reference genome of the samples. -
--tools [glimpse1, quilt, stitch]
: A selection of one or more of the available imputation tools. Each imputation tool has their own set of specific flags and input files. These required files are produced by--steps panelprep
and used as input in:--chunks chunks.csv
: A samplesheet containing chunks per chromosome. These are produced by--steps panelprep
usingGLIMPSE1
.--posfile posfile.csv
: A samplesheet containing a.legend.gz
file with the list of positions to genotype per chromosome. These are required by tools ( QUILT/STITCH/GLIMPSE1). It can also contain thehap.gz
files (required by QUILT). The posfile can be generated with--steps panelprep
.--panel panel.csv
: A samplesheet containing the post-processed reference panel VCF (required by GLIMPSE1, GLIMPSE2). These files can be obtained with--steps panelprep
.
Summary table of required parameters in --steps impute
--steps impute | --input | --genome or --fasta | --panel | --chunks | --posfile | |
---|---|---|---|---|---|---|
GLIMPSE1 | ✅ | ✅ ¹ | ✅ | ✅ | ✅ | ✅ ³ |
GLIMPSE2 | ✅ | ✅ ¹ | ✅ | ✅ | ✅ | ❌ |
QUILT | ✅ | ✅ ² | ✅ | ❌ | ✅ | ✅ ⁴ |
STITCH | ✅ | ✅ ² | ✅ | ❌ | ❌ | ✅ ³ |
¹ Alignment files as well as variant calling format (i.e. BAM, CRAM, VCF or BCF) ² Alignment files only (i.e. BAM or CRAM) ³
QUILT
: Should be a CSV with columns [panel id, chr, hap, legend] ⁴GLIMPSE1 and STITCH
: Should be a CSV with columns [panel id, chr, legend]
Here is a representation on how the input files will be processed depending on the input files type and the selected imputation tool.
Argument --batch_size
The --batch_size
argument is used to specify the number of samples to be processed at once. This is useful when the number of samples is large and the memory is limited. The default value is 100 but it might need to be adapted to the size of each individuals data, the number of samples to be processed in parallel and the available memory.
Imputation softwares algorithm are time consuming. The computational load depend on the number of individuals, the region size and the panel size. Some steps are computationally fixed, meaning they run similarly whether you are imputing 2 individuals or 200. By grouping individuals into larger batches, these fixed-cost steps are shared among more samples, reducing the per-individual computational overhead and improving overall efficiency. This step is recommended On the other hand we also need to limit the memory usage when working with a huge amount of individuals within a process. Hence the necessity to use a batch_size large enough to reduce the fixed-cost stepts / individuals and not to large for the memory usage to be sustainable.
When the number of samples exceeds the batch size, the pipeline will split the samples into batches and process them sequentially. The files used in each batch are stored in the ${outputdir}/imputation/batch
folder.
STITCH and GLIMPSE1 do not support a batch size inferior to the number of samples. This limit is set up to not induce batch effect in the imputation process, as this two tools take into account the information of the target file to perform the imputation. This does on the other hand enhances the accuracy of phasing and imputation, as the target individuals might provide more informative genetic context (e.g. you have related individuals in the target).
To summarize:
- If you have Variant Calling Format file you should join them in one and choose either GLIMPSE1 or GLIMPSE2
- If you have alignment files all the tools are available and their will be processed in batch_size
- Glimpse1 and Stitch might induce batch effect so all the samples need to be imputed together
- Glimpse2 and Quilt can process the samples in different batches
- If you want to disable this option and run each sample separately you can set
--batch_size 1
Imputation tools --steps impute --tools [glimpse1, glimpse2, quilt, stitch]
You can choose different software to perform the imputation. In the following sections, the typical commands for running the pipeline with each software are included. Multiple tools can be selected by separating them with a comma (eg. --tools glimpse1,quilt
).
QUILT
QUILT is an R and C++ program for rapid genotype imputation from low-coverage sequence using a large reference panel. The required inputs for this program are bam samples provided in the input samplesheet (--input
) and a csv file with the genomic chunks (--chunks
).
The csv provided in --posfile
must contain at least four columns [panel, chr, hap, legend]. The first column is the name of the panel, the second is the chromosome, then the hap and legend files produced by --steps panelprep
unique to each chromosome. The hap and legend files are mandatory to use QUILT.
The csv provided in --chunks
must contain two columns [chr, file]. The first column is the chromosome and the file column are txt with the chunks produced by GLIMPSE1, unique to each chromosome.
The file column should contain a TXT/TSV obtained from GLIMPSE1 with the following structure.
If you do not have a csv with chunks, you can provide a reference panel to run the --steps panelprep
which produces a csv with these chunks, which is then used as input for QUILT. You can choose to run both steps sequentially as --steps panelprep,impute
or simply collect the files produced by --steps panelprep
.
STITCH
STITCH is an R program for low coverage sequencing genotype imputation without using a reference panel. The required inputs for this program are bam samples provided in the input samplesheet (--input
) and a .legend.gz
file with the list of positions to genotype (--posfile
). See Posfile section for more information.
If you do not have a list of position to genotype, you can provide a reference panel to run the --steps panelprep
which produces a tsv with this list.
Otherwise, you can provide your own position file in the --steps impute
with STITCH using the the --posfile
parameter.
The csv provided in --posfile
must contain three columns [panel, chr, legend]. See Posfile section for more information.
STITCH only handles bi-allelic SNPs.
If you do not have a reference panel and you would like to obtain the posfile you can use the following command:
GLIMPSE1
GLIMPSE1 is a set of tools for phasing and imputation for low-coverage sequencing datasets. Recommended for many samples at >0.5x coverage and small reference panels. Glimpse1 works with alignment (i.e. BAM or CRAM) as well as variant (i.e. VCF or BCF) files as input. This is an example command to run this tool from the --steps impute
:
The csv provided in --posfile
must contain three columns [panel, chr, legend]. See Posfile section for more information.
The csv provided in --panel
must be prepared with --steps panelprep
and must contain two columns [panel, chr, vcf, index].
GLIMPSE2
GLIMPSE2 is a set of tools for phasing and imputation for low-coverage sequencing datasets. This is an example command to run this tool from the --steps impute
:
Make sure the csv with the input panel is the output from --step panelprep
or has been previously prepared.
Start with validation --steps validate
This steps compares a truth VCF to an imputed VCF in order to compute imputation accuracy.
This also needs the frequency of the alleles. They can be computed from the reference panel by running the --steps panelprep
and using the --panel
with the --compute_freq
flag ; or by using --posfile samplesheet.csv
.
The required flags for this mode only are:
--steps validate
: The steps to run.--input input.csv
: The samplesheet containing the input sample files invcf
orbcf
format.--input_truth input_truth.csv
: The samplesheet containing the truth VCF files invcf
format. This can also acceptbam
orcram
files as input but will need the additionallegend
file in the--posfile
to call the variants. The structure of theinput_truth.csv
is the same as theinput.csv
file. See Samplesheet input for more information.--posfile posfile.csv
: A samplesheet containing the panel sites informations invcf
format for each chromosome.
The csv provided in --posfile
must contain three columns [panel, chr, vcf, index]. See Posfile section for more information.
Run all steps sequentially --steps all
This mode runs all the previous steps. This requires several flags:
--steps all
: The steps to run.--input input.csv
: The samplesheet containing the input sample files inbam
orcram
format.--depth
: The final depth of the input file [default: 1].--genome
or--fasta
: The reference genome of the samples.--tools [glimpse1, glimpse2, quilt, stitch]
: A selection of one or more of the available imputation tools.--panel panel.csv
: The samplesheet containing the reference panel files invcf.gz
format.--remove_samples
: (optional) A comma-separated list of samples to remove from the reference.--input_truth input_truth.csv
: The samplesheet containing the truth VCF files invcf
format. This can also acceptbam
orcram
files as input but will need the additionallegend
file in the--posfile
to call the variants. The structure of theinput_truth.csv
is the same as theinput.csv
file. See Samplesheet input for more information.
Updating the pipeline
When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you’re running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:
Reproducibility
It is a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you’ll be running the same version of the pipeline, even if there have been changes to the code since.
First, go to the nf-core/phaseimpute releases page and find the latest pipeline version - numeric only (eg. 1.3.1
). Then specify this when running the pipeline with -r
(one hyphen) - eg. -r 1.3.1
. Of course, you can switch to another version by changing the number after the -r
flag.
This version number will be logged in reports when you run the pipeline, so that you’ll know what you used when you look back in the future. For example, at the bottom of the MultiQC reports.
To further assist in reproducbility, you can use share and re-use parameter files to repeat pipeline runs with the same settings without having to write out a command with every single parameter.
If you wish to share such profile (such as upload as supplementary material for academic publications), make sure to NOT include cluster specific paths to files, nor institutional specific profiles.
Core Nextflow arguments
These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).
-profile
Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.
Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Apptainer, Conda) - see below.
We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported.
The pipeline also dynamically loads configurations from https://github.com/nf-core/configs when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the nf-core/configs documentation.
Note that multiple profiles can be loaded, for example: -profile test,docker
- the order of arguments is important!
They are loaded in sequence, so later profiles can overwrite earlier profiles.
If -profile
is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH
. This is not recommended, since it can lead to different results on different machines dependent on the computer enviroment.
test
- A profile with a complete configuration for automated testing
- Includes links to test data so needs no other parameters
docker
- A generic configuration profile to be used with Docker
singularity
- A generic configuration profile to be used with Singularity
podman
- A generic configuration profile to be used with Podman
shifter
- A generic configuration profile to be used with Shifter
charliecloud
- A generic configuration profile to be used with Charliecloud
apptainer
- A generic configuration profile to be used with Apptainer
wave
- A generic configuration profile to enable Wave containers. Use together with one of the above (requires Nextflow
24.03.0-edge
or later).
- A generic configuration profile to enable Wave containers. Use together with one of the above (requires Nextflow
conda
- A generic configuration profile to be used with Conda. Please only use Conda as a last resort i.e. when it’s not possible to run the pipeline with Docker, Singularity, Podman, Shifter, Charliecloud, or Apptainer.
-resume
Specify this when restarting a pipeline. Nextflow will use cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. For input to be considered the same, not only the names must be identical but the files’ contents as well. For more info about this parameter, see this blog post.
You can also supply a run name to resume a specific run: -resume [run-name]
. Use the nextflow log
command to show previous run names.
-c
Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information.
Custom configuration
Resource requests
Whilst the default requirements set within the pipeline will hopefully work for most people and with most input data, you may find that you want to customise the compute resources that the pipeline requests. Each steps in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with any of the error codes specified here it will automatically be resubmitted with higher requests (2 x original, then 3 x original). If it still fails after the third attempt then the pipeline execution is stopped.
To change the resource requests, please see the max resources and tuning workflow resources section of the nf-core website.
Custom Containers
In some cases you may wish to change which container or conda environment a steps of the pipeline uses for a particular tool. By default nf-core pipelines use containers and software from the biocontainers or bioconda projects. However in some cases the pipeline specified version maybe out of date.
To use a different container from the default container or conda environment specified in a pipeline, please see the updating tool versions section of the nf-core website.
Custom Tool Arguments
A pipeline might not always support every possible argument or option of a particular tool used in pipeline. Fortunately, nf-core pipelines provide some freedom to users to insert additional parameters that the pipeline does not include by default.
One of the parameters that you might want to modify could be specific to each imputation software. As an example, running the pipeline, you may encounter that to reduce the impact of individual reads (for example in QUILT), you might need to lower coverage. This can be achieved by including any modification to a Nextflow process as an external argument using ext.args
. You would customize the run by providing:
To learn how to provide additional arguments to a particular tool of the pipeline, please see the customising tool arguments section of the nf-core website.
nf-core/configs
In most cases, you will only need to create a custom config as a one-off but if you and others within your organisation are likely to be running nf-core pipelines regularly and need to use the same settings regularly it may be a good idea to request that your custom config file is uploaded to the nf-core/configs
git repository. Before you do this please can you test that the config file works with your pipeline of choice using the -c
parameter. You can then create a pull request to the nf-core/configs
repository with the addition of your config file, associated documentation file (see examples in nf-core/configs/docs
), and amending nfcore_custom.config
to include your custom profile.
See the main Nextflow documentation for more information about creating your own configuration files.
If you have any questions or issues please send us a message on Slack on the #configs
channel.
Running in the background
Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished.
The Nextflow -bg
flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file.
Alternatively, you can use screen
/ tmux
or similar tool to create a detached session which you can log back into at a later time.
Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs).
Nextflow memory requirements
In some cases, the Nextflow Java virtual machines can start to request a large amount of memory.
We recommend adding the following line to your environment to limit this (typically in ~/.bashrc
or ~./bash_profile
):
Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information.