sarek: Parameters

Define where the pipeline should find input data and save output data.

Path to input file(s).

type: string

Use this to specify the location of your input TSV file on mapping, prepare_recalibration, recalibrate, variant_calling and Control-FREEC steps (multiple files can be specified with quotes). It can also be used to specify the path to a directory on mapping step with a single germline sample only. Alternatively, it can be used to specify the path to VCF input file on annotate step (multiple files can be specified with quotes).

Starting step.

type: string

Only one step.

NB step can be specified with no concern for case, or the presence of - or _

The output directory where the results will be saved.

type: string

default: ./results

Option used for most of the pipeline

Tools to use for variant calling and/or for annotation.

type: string

Multiple separated with commas.

Germline variant calling can currently only be performed with the following variant callers:

FreeBayes, HaplotypeCaller, Manta, mpileup, Strelka, TIDDIT

Somatic variant calling can currently only be performed with the following variant callers:

ASCAT, Control-FREEC, FreeBayes, Manta, MSIsensor, Mutect2, Strelka

Tumor-only somatic variant calling can currently only be performed with the following variant callers:

Control-FREEC, Manta, mpileup, Mutect2, TIDDIT

Annotation is done using snpEff, VEP, or even both consecutively.

NB As Sarek will use bgzip and tabix to compress and index VCF files annotated, it expects VCF files to be sorted.

DNAseq, DNAscope and TNscope are only available with --sentieon

NB tools can be specified with no concern for case, or the presence of - or _

Disable usage of intervals.

type: boolean

Intervals are part of the genome chopped up, used to speed up preprocessing and variant calling

Estimate interval size.

type: number

default: 1000

Intervals are part of the genome chopped up, used to speed up preprocessing and variant calling

Enable Sentieon if available.

type: boolean

Sentieon is a commercial solution to process genomics data with high computing efficiency, fast turnaround time, exceptional accuracy, and 100% consistency.

NB Adds the following tools for the --tools options: DNAseq, DNAscope and TNscope.

Disable specified QC and Reporting tools.

type: string

Multiple tools can be specified, separated by commas.

NB --skip_qc BaseRecalibrator is actually just not saving the reports. NB --skip_qc MarkDuplicates does not skip MarkDuplicates but prevent the collection of duplicate metrics that slows down performance. NB tools can be specified with no concern for case, or the presence of - or _

Target BED file for whole exome or targeted sequencing.

type: string

This parameter does not imply that the workflow is running alignment or variant calling only for the supplied targets. Instead, we are aligning for the whole genome, and selecting variants only at the very end by intersecting with the provided target file. Adding every exon as an interval in case of WES can generate >200K processes or jobs, much more forks, and similar number of directories in the Nextflow work directory. Furthermore, primers and/or baits are not 100% specific, (certainly not for MHC and KIR, etc.), quite likely there going to be reads mapping to multiple locations. If you are certain that the target is unique for your genome (all the reads will certainly map to only one location), and aligning to the whole genome is an overkill, it is actually better to change the reference itself.

The recommended flow for targeted sequencing data is to use the workflow as it is, but also provide a BED file containing targets for all steps using the --target_bed option. The workflow will pick up these intervals, and activate any --exome flag in any tools that allow it to process deeper coverage. It is advised to pad the variant calling regions (exons or target) to some extent before submitting to the workflow.

Run Trim Galore.

hidden

type: boolean

Use this to perform adapter trimming with Trim Galore. cf Trim Galore User Guide

Remove bp from the 5' end of read 1.

hidden

type: integer

This may be useful if the qualities were very poor, or if there is some sort of unwanted bias at the 5' end.

Remove bp from the 5' end of read 2.

hidden

type: integer

This may be useful if the qualities were very poor, or if there is some sort of unwanted bias at the 5' end.

Remove bp from the 3' end of read 1 AFTER adapter/quality trimming has been performed.

hidden

type: integer

This may remove some unwanted bias from the 3' end that is not directly related to adapter sequence or basecall quality.

Remove bp from the 3' end of read 2 AFTER adapter/quality trimming has been performed.

hidden

type: integer

This may remove some unwanted bias from the 3' end that is not directly related to adapter sequence or basecall quality.

Apply the --nextseq=X option, to trim based on quality after removing poly-G tails.

hidden

type: integer

This may remove some unwanted bias from the 3' end that is not directly related to adapter sequence or basecall quality.

Save trimmed FastQ file intermediates

hidden

type: boolean

Specify how many reads should be contained in the split FastQ file

hidden

type: number

Use the Nextflow splitFastq operator to specify how many reads should be contained in the split FASTQ file. cf splitfastq documentation

Specify aligner to be used to map reads to reference genome.

hidden

type: string

WARNING Current indices for bwa in AWS iGenomes are not compatible with bwa-mem2. Use --bwa=false to have Sarek build them automatically.

WARNING BWA-mem2 is in active development Sarek might not be able to require the right amount of resources for it at the moment We recommend to use pre-built indexes

Establish values for GATK MarkDuplicates memory consumption

hidden

type: string

default: -Xms4000m -Xmx7g

See SciLifeLab/Sarek/pull/689

Enable usage of GATK Spark implementation

type: boolean

Save Mapped BAMs

type: boolean

Skip GATK MarkDuplicates

type: boolean

This params will also save the mapped BAMS, to enable restart from step prepare_recalibration

Overwrite ASCAT ploidy

type: string

default: null

Requires that --ascat_purity is set

Overwrite ASCAT purity

type: string

default: null

Requires that --ascat_ploidy is set

Overwrite Control-FREEC coefficientOfVariation

type: number

default: 0.05

Overwrite Control-FREEC ploidy

type: string

default: 2

Overwrite Control-FREEC window size

type: number

It is recommended to use a window size of 0 for exome data

Generate g.vcf output from GATK HaplotypeCaller

type: boolean

Will not use Manta candidateSmallIndels for Strelka

type: boolean

Not recommended by Best Practices

Panel-of-normals VCF (bgzipped) for GATK Mutect2 / Sentieon TNscope

type: string

Without PON, there will be no calls with PASS in the INFO field, only an unfiltered VCF is written. It is recommended to make your own PON, as it depends on sequencer and library preparation. For tests in iGenomes there is a dummy PON file in the Annotation/GermlineResource directory, but it should not be used as a real PON file.

See PON documentation

NB PON file should be bgzipped.

Index of PON panel-of-normals VCF

type: string

If none provided, will be generated automatically from the PON bgzipped VCF file.

Do not analyze soft clipped bases in the reads for GATK Mutect2

type: boolean

use the --dont-use-soft-clipped-bases params with GATK.

If provided, UMIs steps will be run to extract and annotate the reads with UMI and create consensus reads

type: boolean

This part of the pipeline uses fgbio to convert the FASTQ files into a unmapped BAM, where reads are tagged with the UMIs extracted from the FASTQ sequences. In order to allow the correct tagging, the UMI sequence must be contained in the read sequence itself, and not in the FASTQ filename. Following this step, the unmapped BAM is aligned and reads are then grouped based on mapping position and UMI tag. Finally, reads in the same groups are collapsed to create a consensus read. To create consensus, we have chosen to use the adjacency method

cf fgbio cf UMIs, the problem, the solution and the proof

NB In order for the correct tagging to be performed, a read structure needs to be specified with --read_structure1 and --readstructure2

When processing UMIs, a read structure should always be provided for each of the fastq files.

type: string

default: null

If the read does not contain any UMI, the structure will be +T (i.e. only template of any length). The read structure follows a format adopted by different tools and described in the fgbio documentation

When processing UMIs, a read structure should always be provided for each of the fastq files.

type: string

default: null

If the read does not contain any UMI, the structure will be +T (i.e. only template of any length). The read structure follows a format adopted by different tools and described in the fgbio documentation

Specify from which tools Sarek should look for VCF files to annotate

hidden

type: string

Only for step annotate

Enable the use of cache for annotation

hidden

type: boolean

And disable usage of Sarek snpeff and vep specific containers for annotation

To be used with --snpeff_cache and/or --vep_cache

Enable CADD cache.

hidden

type: string

default: null

Path to CADD InDels file.

hidden

type: string

default: null

Path to CADD InDels index.

hidden

type: string

default: null

Path to CADD SNVs file.

hidden

type: string

default: null

Path to CADD SNVs index.

hidden

type: string

default: null

Enable the use of the VEP GeneSplicer plugin.

hidden

type: boolean

Path to snpEff cache

hidden

type: string

default: null

To be used with --annotation_cache

Path to VEP cache

hidden

type: string

default: null

To be used with --annotation_cache

Options for the reference genome files

Name of iGenomes reference.

type: string

If using a reference genome configured in the pipeline using iGenomes, use this parameter to give the ID for the reference. This is then used to build the full paths for all required reference genome files e.g. --genome GRCh38.

See the nf-core website docs for more details.

Path to ASCAT loci file.

type: string

Path to ASCAT GC correction file.

type: string

Path to BWA mem indices.

type: string

NB If none provided, will be generated automatically from the FASTA reference.

Path to chromosomes folder.

type: string

Path to chromosomes length file.

type: string

Path to dbsnp file.

type: string

Path to dbsnp index.

type: string

NB If none provided, will be generated automatically from the dbsnp file.

Path to FASTA dictionary file.

type: string

NB If none provided, will be generated automatically from the FASTA reference.

Path to FASTA genome file.

type: string

If you have no genome reference available, the pipeline can build one using a FASTA file. This requires additional time and resources, so it's better to use a pre-build index if possible.

Path to FASTA reference index.

type: string

NB If none provided, will be generated automatically from the FASTA reference

Path to GATK Mutect2 Germline Resource File

type: string

The germline resource VCF file (bgzipped and tabixed) needed by GATK4 Mutect2 is a collection of calls that are likely present in the sample, with allele frequencies. The AF info field must be present. You can find a smaller, stripped gnomAD VCF file (most of the annotation is removed and only calls signed by PASS are stored) in the AWS iGenomes Annotation/GermlineResource folder.

Path to GATK Mutect2 Germline Resource Index

type: string

NB If none provided, will be generated automatically from the Germline Resource file, if provided

Path to intervals file

type: string

To speed up some preprocessing and variant calling processes, the reference is chopped into smaller pieces. The intervals are chromosomes cut at their centromeres (so each chromosome arm processed separately) also additional unassigned contigs. We are ignoring the hs37d5 contig that contains concatenated decoy sequences. Parts of preprocessing and variant calling are done by these intervals, and the different resulting files are then merged. This can parallelize processes, and push down wall clock time significantly.

The calling intervals can be defined using a .list or a BED file. A .list file contains one interval per line in the format chromosome:start-end (1-based coordinates). A BED file must be a tab-separated text file with one interval per line. There must be at least three columns: chromosome, start, and end (0-based coordinates). Additionally, the score column of the BED file can be used to provide an estimate of how many seconds it will take to call variants on that interval. The fourth column remains unused.

|chr1|10000|207666|NA|47.3|

This indicates that variant calling on the interval chr1:10001-207666 takes approximately 47.3 seconds.

The runtime estimate is used in two different ways. First, when there are multiple consecutive intervals in the file that take little time to compute, they are processed as a single job, thus reducing the number of processes that needs to be spawned. Second, the jobs with largest processing time are started first, which reduces wall-clock time. If no runtime is given, a time of 1000 nucleotides per second is assumed. Actual figures vary from 2 nucleotides/second to 30000 nucleotides/second. If you prefer, you can specify the full path to your reference genome when you run the pipeline:

NB If none provided, will be generated automatically from the FASTA reference NB Use --no_intervals to disable automatic generation

Path to known indels file

type: string

Path to known indels file index

type: string

NB If none provided, will be generated automatically from the known index file, if provided

Path to Control-FREEC mappability file

type: string

snpEff DB version

type: string

snpEff species

type: string

If you use AWS iGenomes or a local resource with genomes.conf, this has already been set for you appropriately.

VEP cache version

type: string

Save built references

type: boolean

Directory / URL base for iGenomes references.

hidden

type: string

default: s3://ngi-igenomes/igenomes/

Directory / URL base for genomes references.

hidden

type: string

default: null

All files are supposed to be in the same folder

Do not load the iGenomes reference config.

hidden

type: boolean

Do not load igenomes.config when running the pipeline. You may choose this option if you observe clashes between custom parameters and those supplied in igenomes.config. This option will load the genomes.config file instead.

NB You can then specify the genome custom and specify at least a FASTA genome file.

Less common options for the pipeline, typically set in a config file.

Display help text.

hidden

type: boolean

You're reading it.

Method used to save pipeline results to output directory.

hidden

type: string

The Nextflow publishDir option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See Nextflow docs for details.

Workflow name.

hidden

type: string

A custom name for the pipeline run. Unlike the core Nextflow -name option with one hyphen this parameter can be reused multiple times, for example if using -resume. Passed through to steps such as MultiQC and used for things like report filenames and titles.

Email address for completion summary.

type: string

pattern: ^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$

Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file (~/.nextflow/config) then you don't need to specify this on the command line for every run.

Email address for completion summary, only when pipeline fails.

hidden

type: string

pattern: ^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$

This works exactly as with --email, except emails are only sent if the workflow is not successful.

Send plain-text email instead of HTML.

hidden

type: boolean

Set to receive plain-text e-mails instead of HTML formatted.

File size limit when attaching MultiQC reports to summary emails.

hidden

type: string

default: 25.MB

If file generated by pipeline exceeds the threshold, it will not be attached.

Do not use coloured log outputs.

hidden

type: boolean

Set to disable colourful command line output and live life in monochrome.

Path to MultiQC custom config file.

hidden

type: string

Directory to keep pipeline Nextflow logs and reports.

hidden

type: string

default: ${params.outdir}/pipeline_info

Name of sequencing center to be displayed in BAM file

type: string

default: null

It will be in the CN field

Set the top limit for requested resources for any single job.

type: integer

default: 8

Should be an integer e.g. --cpus 7

Use to set memory for a single CPU.

type: string

default: 7 GB

Should be a string in the format integer-unit eg. --single_cpu_mem '8.GB'

Maximum number of CPUs that can be requested for any single job.

hidden

type: integer

default: 16

Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. --max_cpus 1

Maximum amount of memory that can be requested for any single job.

hidden

type: string

default: 128.GB

Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. --max_memory '8.GB'

Maximum amount of time that can be requested for any single job.

hidden

type: string

default: 240.h

Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. --max_time '2.h'

Parameters used to describe centralised config profiles. These should not be edited.

Git commit id for Institutional configs.

hidden

type: string

default: master

Provide git commit id for custom Institutional configs hosted at nf-core/configs. This was implemented for reproducibility purposes. Default: master.

## Download and use config file with following git commit id
--custom_config_version d52db660777c4bf36546ddb188ec530c3ada1b96

Base directory for Institutional configs.

hidden

type: string

default: https://raw.githubusercontent.com/nf-core/configs/master

If you're running offline, nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell nextflow where to find them with the custom_config_base option. For example:

## Download and unzip the config files
cd /path/to/my/configs
wget https://github.com/nf-core/configs/archive/master.zip
unzip master.zip

## Run the pipeline
cd /path/to/my/data
nextflow run /path/to/pipeline/ --custom_config_base /path/to/my/configs/configs-master/

Note that the nf-core/tools helper package has a download command to download all required pipeline files + singularity containers + institutional configs in one go for you, to make this process easier.

Institutional configs hostname.

hidden

type: string

Institutional config description.

hidden

type: string

Institutional config contact information.

hidden

type: string

Institutional config URL link.

hidden

type: string

nf-core/sarek