nf-core/sarek
Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
2.7.1
). The latest
stable release is
3.5.0
.
Define where the pipeline should find input data and save output data.
Path to input file(s).
string
Use this to specify the location of your input TSV file on mapping
, prepare_recalibration
, recalibrate
, variant_calling
and Control-FREEC
steps (multiple files can be specified with quotes).
It can also be used to specify the path to a directory on mapping
step with a single germline sample only.
Alternatively, it can be used to specify the path to VCF input file on annotate
step (multiple files can be specified with quotes).
Starting step.
string
Only one step.
NB step can be specified with no concern for case, or the presence of
-
or_
The output directory where the results will be saved.
string
./results
Option used for most of the pipeline
Tools to use for variant calling and/or for annotation.
string
null
Multiple separated with commas.
Germline variant calling can currently only be performed with the following variant callers:
- FreeBayes, HaplotypeCaller, Manta, mpileup, Strelka, TIDDIT
Somatic variant calling can currently only be performed with the following variant callers:
- ASCAT, Control-FREEC, FreeBayes, Manta, MSIsensor, Mutect2, Strelka
Tumor-only somatic variant calling can currently only be performed with the following variant callers:
- Control-FREEC, Manta, mpileup, Mutect2, TIDDIT
Annotation is done using snpEff, VEP, or even both consecutively.
NB As Sarek will use bgzip and tabix to compress and index VCF files annotated, it expects VCF files to be sorted.
DNAseq
, DNAscope
and TNscope
are only available with --sentieon
NB tools can be specified with no concern for case, or the presence of
-
or_
Disable usage of intervals.
boolean
Intervals are part of the genome chopped up, used to speed up preprocessing and variant calling
Estimate interval size.
number
1000
Intervals are part of the genome chopped up, used to speed up preprocessing and variant calling
Enable Sentieon if available.
boolean
Sentieon is a commercial solution to process genomics data with high computing efficiency, fast turnaround time, exceptional accuracy, and 100% consistency.
NB Adds the following tools for the
--tools
options:DNAseq
,DNAscope
andTNscope
.
Disable specified QC and Reporting tools.
string
null
Multiple tools can be specified, separated by commas.
NB
--skip_qc BaseRecalibrator
is actually just not saving the reports.
NB--skip_qc MarkDuplicates
does not skipMarkDuplicates
but prevent the collection of duplicate metrics that slows down performance.
NB tools can be specified with no concern for case, or the presence of-
or_
Target BED file for whole exome or targeted sequencing.
string
This parameter does not imply that the workflow is running alignment or variant calling only for the supplied targets.
Instead, we are aligning for the whole genome, and selecting variants only at the very end by intersecting with the provided target file.
Adding every exon as an interval in case of WES
can generate >200K processes or jobs, much more forks, and similar number of directories in the Nextflow work directory.
Furthermore, primers and/or baits are not 100% specific, (certainly not for MHC and KIR, etc.), quite likely there going to be reads mapping to multiple locations.
If you are certain that the target is unique for your genome (all the reads will certainly map to only one location), and aligning to the whole genome is an overkill, it is actually better to change the reference itself.
The recommended flow for targeted sequencing data is to use the workflow as it is, but also provide a BED
file containing targets for all steps using the --target_bed
option.
The workflow will pick up these intervals, and activate any --exome
flag in any tools that allow it to process deeper coverage.
It is advised to pad the variant calling regions (exons or target) to some extent before submitting to the workflow.
Run Trim Galore.
boolean
Use this to perform adapter trimming with Trim Galore.
cf Trim Galore User Guide
Remove bp from the 5' end of read 1.
integer
This may be useful if the qualities were very poor, or if there is some sort of unwanted bias at the 5' end.
Remove bp from the 5' end of read 2.
integer
This may be useful if the qualities were very poor, or if there is some sort of unwanted bias at the 5' end.
Remove bp from the 3' end of read 1 AFTER adapter/quality trimming has been performed.
integer
This may remove some unwanted bias from the 3' end that is not directly related to adapter sequence or basecall quality.
Remove bp from the 3' end of read 2 AFTER adapter/quality trimming has been performed.
integer
This may remove some unwanted bias from the 3' end that is not directly related to adapter sequence or basecall quality.
Apply the --nextseq=X option, to trim based on quality after removing poly-G tails.
integer
This may remove some unwanted bias from the 3' end that is not directly related to adapter sequence or basecall quality.
Save trimmed FastQ file intermediates
boolean
Specify how many reads should be contained in the split FastQ file
number
Use the Nextflow splitFastq operator to specify how many reads should be contained in the split FASTQ file.
cf splitfastq documentation
Specify aligner to be used to map reads to reference genome.
string
WARNING Current indices for
bwa
in AWS iGenomes are not compatible withbwa-mem2
.
Use--bwa=false
to haveSarek
build them automatically.
WARNING BWA-mem2 is in active development
Sarek might not be able to require the right amount of resources for it at the moment
We recommend to use pre-built indexes
Establish values for GATK MarkDuplicates memory consumption
string
-Xms4000m -Xmx7g
Enable usage of GATK Spark implementation
boolean
Save Mapped BAMs
boolean
Skip GATK MarkDuplicates
boolean
This params will also save the mapped BAMS, to enable restart from step prepare_recalibration
Overwrite ASCAT ploidy
string
null
Requires that --ascat_purity
is set
Overwrite ASCAT purity
string
null
Requires that --ascat_ploidy
is set
Overwrite Control-FREEC coefficientOfVariation
number
0.05
Overwrite Control-FREEC contaminationAdjustement
boolean
Design known contamination value for Control-FREEC
string
null
Overwrite Control-FREEC ploidy
number
2
Overwrite Control-FREEC window size
number
It is recommended to use a window size of 0 for exome data
Generate g.vcf output from GATK HaplotypeCaller
boolean
Will not use Manta candidateSmallIndels for Strelka
boolean
Not recommended by Best Practices
Panel-of-normals VCF (bgzipped) for GATK Mutect2 / Sentieon TNscope
string
Without PON, there will be no calls with PASS in the INFO field, only an unfiltered VCF is written.
It is recommended to make your own PON, as it depends on sequencer and library preparation.
For tests in iGenomes there is a dummy PON file in the Annotation/GermlineResource directory, but it should not be used as a real PON file.
NB PON file should be bgzipped.
Index of PON panel-of-normals VCF
string
If none provided, will be generated automatically from the PON bgzipped VCF file.
Do not analyze soft clipped bases in the reads for GATK Mutect2
boolean
use the --dont-use-soft-clipped-bases
params with GATK.
If provided, UMIs steps will be run to extract and annotate the reads with UMI and create consensus reads
boolean
This part of the pipeline uses fgbio to convert the FASTQ files into a unmapped BAM, where reads are tagged with the UMIs extracted from the FASTQ sequences.
In order to allow the correct tagging, the UMI sequence must be contained in the read sequence itself, and not in the FASTQ filename.
Following this step, the unmapped BAM is aligned and reads are then grouped based on mapping position and UMI tag.
Finally, reads in the same groups are collapsed to create a consensus read.
To create consensus, we have chosen to use the adjacency method
cf fgbio
cf UMIs, the problem, the solution and the proof
NB In order for the correct tagging to be performed, a read structure needs to be specified with
--read_structure1
and--readstructure2
When processing UMIs, a read structure should always be provided for each of the fastq files.
string
null
If the read does not contain any UMI, the structure will be +T (i.e. only template of any length).
The read structure follows a format adopted by different tools and described in the fgbio documentation
When processing UMIs, a read structure should always be provided for each of the fastq files.
string
null
If the read does not contain any UMI, the structure will be +T (i.e. only template of any length).
The read structure follows a format adopted by different tools and described in the fgbio documentation
Specify from which tools Sarek should look for VCF files to annotate
string
null
Only for step annotate
Enable the use of cache for annotation
boolean
And disable usage of Sarek snpeff and vep specific containers for annotation
To be used with --snpeff_cache
and/or --vep_cache
Enable CADD cache.
boolean
Path to CADD InDels file.
string
null
Path to CADD InDels index.
string
null
Path to CADD SNVs file.
string
null
Path to CADD SNVs index.
string
null
Enable the use of the VEP GeneSplicer plugin.
boolean
Path to snpEff cache
string
null
To be used with --annotation_cache
Path to VEP cache
string
null
To be used with --annotation_cache
Options for the reference genome files
Name of iGenomes reference.
string
If using a reference genome configured in the pipeline using iGenomes, use this parameter to give the ID for the reference. This is then used to build the full paths for all required reference genome files e.g. --genome GRCh38
.
See the nf-core website docs for more details.
Path to ASCAT loci file.
string
Path to ASCAT GC correction file.
string
Path to BWA mem indices.
string
NB If none provided, will be generated automatically from the FASTA reference.
Path to chromosomes folder.
string
Path to chromosomes length file.
string
Path to dbsnp file.
string
Path to dbsnp index.
string
NB If none provided, will be generated automatically from the dbsnp file.
Path to FASTA dictionary file.
string
NB If none provided, will be generated automatically from the FASTA reference.
Path to FASTA genome file.
string
If you have no genome reference available, the pipeline can build one using a FASTA file. This requires additional time and resources, so it's better to use a pre-build index if possible.
Path to FASTA reference index.
string
NB If none provided, will be generated automatically from the FASTA reference
Path to GATK Mutect2 Germline Resource File
string
The germline resource VCF file (bgzipped and tabixed) needed by GATK4 Mutect2 is a collection of calls that are likely present in the sample, with allele frequencies.
The AF info field must be present.
You can find a smaller, stripped gnomAD VCF file (most of the annotation is removed and only calls signed by PASS are stored) in the AWS iGenomes Annotation/GermlineResource folder.
Path to GATK Mutect2 Germline Resource Index
string
NB If none provided, will be generated automatically from the Germline Resource file, if provided
Path to intervals file
string
To speed up some preprocessing and variant calling processes, the reference is chopped into smaller pieces.
The intervals are chromosomes cut at their centromeres (so each chromosome arm processed separately) also additional unassigned contigs.
We are ignoring the hs37d5
contig that contains concatenated decoy sequences.
Parts of preprocessing and variant calling are done by these intervals, and the different resulting files are then merged.
This can parallelize processes, and push down wall clock time significantly.
The calling intervals can be defined using a .list or a BED file.
A .list file contains one interval per line in the format chromosome:start-end
(1-based coordinates).
A BED file must be a tab-separated text file with one interval per line.
There must be at least three columns: chromosome, start, and end (0-based coordinates).
Additionally, the score column of the BED file can be used to provide an estimate of how many seconds it will take to call variants on that interval.
The fourth column remains unused.
|chr1|10000|207666|NA|47.3|
This indicates that variant calling on the interval chr1:10001-207666 takes approximately 47.3 seconds.
The runtime estimate is used in two different ways.
First, when there are multiple consecutive intervals in the file that take little time to compute, they are processed as a single job, thus reducing the number of processes that needs to be spawned.
Second, the jobs with largest processing time are started first, which reduces wall-clock time.
If no runtime is given, a time of 1000 nucleotides per second is assumed.
Actual figures vary from 2 nucleotides/second to 30000 nucleotides/second.
If you prefer, you can specify the full path to your reference genome when you run the pipeline:
NB If none provided, will be generated automatically from the FASTA reference
NB Use --no_intervals to disable automatic generation
Path to known indels file
string
Path to known indels file index
string
NB If none provided, will be generated automatically from the known index file, if provided
Path to Control-FREEC mappability file
string
snpEff DB version
string
snpEff species
string
If you use AWS iGenomes or a local resource with genomes.conf, this has already been set for you appropriately.
VEP cache version
string
Save built references
boolean
Directory / URL base for iGenomes references.
string
s3://ngi-igenomes/igenomes
Directory / URL base for genomes references.
string
null
All files are supposed to be in the same folder
Do not load the iGenomes reference config.
boolean
Do not load igenomes.config
when running the pipeline.
You may choose this option if you observe clashes between custom parameters and those supplied in igenomes.config
.
This option will load the genomes.config
file instead.
NB You can then specify the genome custom and specify at least a FASTA genome file.
Less common options for the pipeline, typically set in a config file.
Display help text.
boolean
You're reading it.
Method used to save pipeline results to output directory.
string
The Nextflow publishDir
option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See Nextflow docs for details.
Email address for completion summary.
string
^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file (~/.nextflow/config
) then you don't need to specify this on the command line for every run.
Boolean whether to validate parameters against the schema at runtime
boolean
true
Email address for completion summary, only when pipeline fails.
string
^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
This works exactly as with --email
, except emails are only sent if the workflow is not successful.
Send plain-text email instead of HTML.
boolean
Set to receive plain-text e-mails instead of HTML formatted.
File size limit when attaching MultiQC reports to summary emails.
string
25.MB
If file generated by pipeline exceeds the threshold, it will not be attached.
Do not use coloured log outputs.
boolean
Set to disable colourful command line output and live life in monochrome.
Path to MultiQC custom config file.
string
Directory to keep pipeline Nextflow logs and reports.
string
${params.outdir}/pipeline_info
Name of sequencing center to be displayed in BAM file
string
null
It will be in the CN field
Show all params when using --help
boolean
Set the top limit for requested resources for any single job.
integer
8
Should be an integer e.g. --cpus 7
Use to set memory for a single CPU.
string
7 GB
Should be a string in the format integer-unit eg. --single_cpu_mem '8.GB'
Maximum number of CPUs that can be requested for any single job.
integer
16
Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. --max_cpus 1
Maximum amount of memory that can be requested for any single job.
string
128.GB
^\d+(\.\d+)?\.?\s*(K|M|G|T)?B$
Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. --max_memory '8.GB'
Maximum amount of time that can be requested for any single job.
string
240.h
^(\d+\.?\s*(s|m|h|day)\s*)+$
Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. --max_time '2.h'
Parameters used to describe centralised config profiles. These should not be edited.
Git commit id for Institutional configs.
string
master
Provide git commit id for custom Institutional configs hosted at nf-core/configs
. This was implemented for reproducibility purposes. Default: master
.
## Download and use config file with following git commit id
--custom_config_version d52db660777c4bf36546ddb188ec530c3ada1b96
Base directory for Institutional configs.
string
https://raw.githubusercontent.com/nf-core/configs/master
If you're running offline, nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell nextflow where to find them with the custom_config_base
option. For example:
## Download and unzip the config files
cd /path/to/my/configs
wget https://github.com/nf-core/configs/archive/master.zip
unzip master.zip
## Run the pipeline
cd /path/to/my/data
nextflow run /path/to/pipeline/ --custom_config_base /path/to/my/configs/configs-master/
Note that the nf-core/tools helper package has a
download
command to download all required pipeline files + singularity containers + institutional configs in one go for you, to make this process easier.
Institutional configs hostname.
string
Institutional config name.
string
Institutional config description.
string
Institutional config contact information.
string
Institutional config URL link.
string