To speed up some preprocessing and variant calling processes, the reference is chopped into smaller pieces.
The intervals are chromosomes cut at their centromeres (so each chromosome arm processed separately) also additional unassigned contigs.
We are ignoring the
hs37d5 contig that contains concatenated decoy sequences.
Parts of preprocessing and variant calling are done by these intervals, and the different resulting files are then merged.
This can parallelize processes, and push down wall clock time significantly.
The calling intervals can be defined using a .list or a BED file.
A .list file contains one interval per line in the format
chromosome:start-end (1-based coordinates).
A BED file must be a tab-separated text file with one interval per line.
There must be at least three columns: chromosome, start, and end (0-based coordinates).
Additionally, the score column of the BED file can be used to provide an estimate of how many seconds it will take to call variants on that interval.
The fourth column remains unused.
This indicates that variant calling on the interval chr1:10001-207666 takes approximately 47.3 seconds.
The runtime estimate is used in two different ways.
First, when there are multiple consecutive intervals in the file that take little time to compute, they are processed as a single job, thus reducing the number of processes that needs to be spawned.
Second, the jobs with largest processing time are started first, which reduces wall-clock time.
If no runtime is given, a time of 1000 nucleotides per second is assumed.
Actual figures vary from 2 nucleotides/second to 30000 nucleotides/second.
If you prefer, you can specify the full path to your reference genome when you run the pipeline:
* NB* If none provided, will be generated automatically from the FASTA reference
* NB* Use --no_intervals to disable automatic generation