Deep Variant as a Nextflow pipeline
A Nextflow pipeline for running the Google DeepVariant variant caller.
What is DeepVariant and why in Nextflow?
The Google Brain Team in December 2017 released a Variant Caller based on DeepLearning: DeepVariant.
In practice, DeepVariant first builds images based on the BAM file, then it uses a DeepLearning image recognition approach to obtain the variants and eventually it converts the output of the prediction in the standard VCF format.
DeepVariant as a Nextflow pipeline provides several advantages to the users. It handles automatically, through preprocessing steps, the creation of some extra needed indexed and compressed files which are a necessary input for DeepVariant, and which should normally manually be produced by the users. Variant Calling can be performed at the same time on multiple BAM files and thanks to the internal parallelization of Nextflow no resources are wasted. Nextflow’s support of Docker allows to produce the results in a computational reproducible and clean way by running every step inside of a Docker container.
Warning DeepVariant can be very computationally intensive to run.
To test the pipeline you can run:
A typical run on whole genome data looks like this:
In this case variants are called on the bam files contained in the testdata directory. The hg19 version of the reference genome is used. One vcf files is produced and can be found in the folder “results”
A typical run on whole exome data looks like this:
The nf-core/deepvariant documentation is split into the following files:
- Running the pipeline
- Pipeline configuration
- Output and how to interpret the results
- More about DeepVariant
More about the pipeline
As shown in the following picture, the worklow both contains preprocessing steps ( light blue ones ) and proper variant calling steps ( darker blue ones ).
Some input files ar optional and if not given, they will be automatically created for the user during the preprocessing steps. If these are given, the preprocessing steps are skipped. For more information about preprocessing, please refer to the “INPUT PARAMETERS” section.
The worklow accepts one reference genome and multiple BAM files as input. The variant calling for the several input BAM files will be processed completely indipendently and will produce indipendent VCF result files. The advantage of this approach is that the variant calling of the different BAM files can be parallelized internally by Nextflow and take advantage of all the cores of the machine in order to get the results at the fastest.
This pipeline was originally developed at Lifebit, by @luisas, to ease and reduce cost for variant calling analyses
Many thanks to nf-core and those who have helped out along the way too, including (but not limited to): @ewels, @MaxUlysse, @apeltzer, @sven1103 & @pditommaso