To speed up preprocessing and variant calling processes, the execution is parallelized across a reference chopped into smaller pieces.
Parts of preprocessing and variant calling are done by these intervals, the different resulting files are then merged.
This can parallelize processes, and push down wall clock time significantly.
We are aligning to the whole genome, and then run Base Quality Score Recalibration and Variant Calling on the supplied regions.
*Whole Genome Sequencing:*
The (provided) intervals are chromosomes cut at their centromeres (so each chromosome arm processed separately) also additional unassigned contigs.
We are ignoring the
hs37d5 contig that contains concatenated decoy sequences.
The calling intervals can be defined using a .list or a BED file.
A .list file contains one interval per line in the format
chromosome:start-end (1-based coordinates).
A BED file must be a tab-separated text file with one interval per line.
There must be at least three columns: chromosome, start, and end (0-based coordinates).
Additionally, the score column of the BED file can be used to provide an estimate of how many seconds it will take to call variants on that interval.
The fourth column remains unused.
This indicates that variant calling on the interval chr1:10001-207666 takes approximately 47.3 seconds.
The runtime estimate is used in two different ways.
First, when there are multiple consecutive intervals in the file that take little time to compute, they are processed as a single job, thus reducing the number of processes that needs to be spawned.
Second, the jobs with largest processing time are started first, which reduces wall-clock time.
If no runtime is given, a time of 200000 nucleotides per second is assumed. See
--nucleotides_per_second on how to customize this.
Actual figures vary from 2 nucleotides/second to 30000 nucleotides/second.
If you prefer, you can specify the full path to your reference genome when you run the pipeline:
*NB* If none provided, will be generated automatically from the FASTA reference
*NB* Use --no_intervals to disable automatic generation.
The recommended flow for targeted sequencing data is to use the workflow as it is, but also provide a
BED file containing targets for all steps using the
--intervals option. In addition, the parameter
--wes should be set.
It is advised to pad the variant calling regions (exons or target) to some extent before submitting to the workflow.
The procedure is similar to whole genome sequencing, except that only BED file are accepted. See above for formatting description.
Adding every exon as an interval in case of
WES can generate >200K processes or jobs, much more forks, and similar number of directories in the Nextflow work directory. These are appropriately grouped together to reduce number of processes run in parallel (see above and
--nucleotides_per_second for details).
Furthermore, primers and/or baits are not 100% specific, (certainly not for MHC and KIR, etc.), quite likely there going to be reads mapping to multiple locations.
If you are certain that the target is unique for your genome (all the reads will certainly map to only one location), and aligning to the whole genome is an overkill, it is actually better to change the reference itself.