Mount Skierfe outlined showing the inspiration for the Sarek logo

We are extremely happy to see this paper out describing the changes and updates to nf-core/sarek, the DNA variant calling pipeline, in the last several years.

In 2020, we embarked on the journey of rewriting the whole pipeline in DSL2. One of the major motivations was to bring down cloud computing costs and generally reduce storage space and computational resources.

Overview

No nf-core pipeline without a metro map 🚇:

Metromap of nf-core/sarek

New tools

We added new tools: BwaMem2 and DragMap for alignment, more variant callers (DeepVariant, GATK HaplotypeCaller Joint Calling & Single Sample variant recalibration, CNVKit, Tiddit), and more annotation possibilities.

Some tools were replaced: Trimming is now done with FastP, CRAM quality control with Mosdepth. For convenience we added more quality control options: When starting from variant-calling directly, all input files can now run through the alignment QC steps.

Thor throwing a cup and screaming 'Another'

Resource optimization

Use CRAM files

We ditched BAM format where possible and switched to CRAM saving us 4x work storage space.

Toy story meme Bar chart showing reduction in storage space usage

Split files (but not too much)

The splitFastq() operator was replaced by a FastP process to split the Fastq files before alignment (default 12) plus replacing trimgalore! for read trimming. We also changed the default grouping of the intervals for BQSR to 21 (instead of 124) reducing storage space another 4x and speeding up processing.

Evaluation of FastP usage and different interval group sizes for BQSR

Cost savings

Overall, we reduced computational costs on AWS (last summer, using spot instances) by 70% to about $20 from FASTQs to annotated VCFs using Strelka, Manta, and VEP.

Little geco saying: 'you could save'

Benchmarking: a.k.a is it any good?

We benchmarked the germline track with Illumina, MGI, and BGI GiaB samples and the somatic track with Seq2C samples. We recently joined the NCBench effort to continuously validate the pipeline on release.

Team work makes dream work

This was a gigantic team effort with Lasse Folkersen, Anders Sune Pedersen, Francesco Lescai, Susanne Jodoin, Edmund Miller, Matthias Seybold, Oskar Wacker, Nick Smith, Gisela Gabernet, Sven Nahnsen, and many many others from the nf-core community:

Also shout out to all the amazing people starting sarek way back in 2016: Szilveszter Juhos, Malin Larsson, Pall I. Olason, Marcel Martin, Jesper Eisfeldt, Sebastian DiLorenzo, Johanna Sandgren, Teresita Díaz De Ståhl, Phil Ewels, Valtteri Wirta, Monica Nistér, Max Käller, and Björn Nystedt.

Join the fun

If you want to join us, visit: https://nf-co.re/join/ we’re on the #sarek channel on slack, and you’re welcome to join #sarek_dev if you really want to get involved.

There is more

If you want to know more, here are some recent talks detailing the changes and development journey:


Published on
24 April 2024