Introduction

This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

Directory Structure

The default directory structure is as follows

{outdir}
├── csv
├── multiqc
├── pipeline_info
├── preprocessing
│   ├── markduplicates
│       └── <sample>
│   ├── recal_table
│       └── <sample>
│   └── recalibrated
│       └── <sample>
├── reference
└── reports
    ├── <tool1>
    └── <tool2>
work/
.nextflow.log

Preprocessing

Sarek pre-processes raw FastQ files or unmapped BAM files, based on GATK best practices.

Preparation of input files (FastQ or (u)BAM)

FastP is a tool designed to provide all-in-one preprocessing for FastQ files and as such is used for trimming and splitting. By default, these files are not published. However, if publishing is enabled, please be aware that these files are only published once, meaning if trimming and splitting is enabled, then the resulting files will be sharded FastQ files with trimmed reads. If only one of them is enabled then the files contain either trimmed or split reads, respectively.

Clip and filter read length

FastP enables efficient clipping of reads from either the 5’ end (--clip_r1, --clip_r2) or the 3’ end (--three_prime_clip_r1, --three_prime_clip_r2). Additionally, FastP allows the filtering of reads based on insert size by specifying a minimum required length with the --length_required parameter (default: 15bp). It is recommended to optimize these parameters according to the specific characteristics of your data.

Trim adapters

FastP supports global trimming, which means it trims all reads in the front or the tail. This function is useful since sometimes you want to drop some cycles of a sequencing run. In the current implementation in Sarek --detect_adapter_for_pe is set by default which enables auto-detection of adapter sequences. For more information on how to fine-tune adapter trimming, take a look into the parameter docs.

The resulting files are intermediate and by default not kept in the final files delivered to users. Set --save_trimmed to enable publishing of the files in:

Output files for all samples

Output directory: {outdir}/preprocessing/fastp/<sample>

  • <sample>_<lane>_{1,2}.fastp.fastq.gz>
    • Bgzipped FastQ file

Split FastQ files

FastP supports splitting of one FastQ file into multiple files allowing parallel alignment of sharded FastQ file. To enable splitting, the number of reads per output can be specified. For more information, take a look into the parameter --split_fastqin the parameter docs.

These files are intermediate and by default not placed in the output-folder kept in the final files delivered to users. Set --save_split to enable publishing of these files to:

Output files for all samples

Output directory: {outdir}/preprocessing/fastp/<sample>/

  • <sample_lane_{1,2}.fastp.fastq.gz>
    • Bgzipped FastQ file

UMI consensus

Sarek can process UMI-reads, using fgbio tools.

These files are intermediate and by default not placed in the output-folder kept in the final files delivered to users. Set --save_split to enable publishing of these files to:

Output files for all samples

Output directory: {outdir}/preprocessing/umi/<sample>/

  • <sample_lane_{1,2}.umi-consensus.bam>

Output directory: {outdir}/reports/umi/

  • <sample_lane_{1,2}_umi_histogram.txt>

Map to Reference

BWA

BWA is a software package for mapping low-divergent sequences against a large reference genome. The aligned reads are then coordinate-sorted (or name-sorted if GATK MarkDuplicatesSpark is used for duplicate marking) with samtools.

BWA-mem2

BWA-mem2 is a software package for mapping low-divergent sequences against a large reference genome.The aligned reads are then coordinate-sorted (or name-sorted if GATK MarkDuplicatesSpark is used for duplicate marking) with samtools.

DragMap

DragMap is an open-source software implementation of the DRAGEN mapper, which the Illumina team created so that we would have an open-source way to produce the same results as their proprietary DRAGEN hardware. The aligned reads are then coordinate-sorted (or name-sorted if GATK MarkDuplicatesSpark is used for duplicate marking) with samtools.

These files are intermediate and by default not placed in the output-folder kept in the final files delivered to users. Set --save_mapped to enable publishing, furthermore add the flag save_output_as_bam for publishing in BAM format.

Sentieon BWA mem

Sentieon bwa mem is a subroutine for mapping low-divergent sequences against a large reference genome. It is part of the proprietary software package DNAseq from Sentieon.

The aligned reads are coordinate-sorted with Sentieon.

Output files for all mappers and samples

The alignment files (BAM or CRAM) produced by the chosen aligner are not published by default. CRAM output files will not be saved in the output-folder (outdir), unless the flag --save_mapped is used. BAM output can be selected by setting the flag --save_output_as_bam.

Output directory: {outdir}/preprocessing/mapped/<sample>/

  • if --save_mapped: <sample>.sorted.cram and <sample>.sorted.cram.crai

    • CRAM file and index
  • if --save_mapped --save_output_as_bam: <sample>.sorted.bam and <sample>.sorted.bam.bai

    • BAM file and index

Mark Duplicates

During duplicate marking, read pairs that are likely to have originated from duplicates of the same original DNA fragments through some artificial processes are identified. These are considered to be non-independent observations, so all but a single read pair within each set of duplicates are marked, causing the marked pairs to be ignored by default during the variant discovery process.

For further reading and documentation see the data pre-processing for variant discovery from the GATK best practices.

GATK MarkDuplicates (Spark)

By default, Sarek will use GATK MarkDuplicates.

To use the corresponding spark implementation GATK MarkDuplicatesSpark, please specify --use_gatk_spark markduplicates. The resulting files are converted to CRAM with either samtools, when GATK MarkDuplicates is used, or, implicitly, by GATK MarkDuplicatesSpark.

The resulting CRAM files are delivered to the users.

Output files for all samples

Output directory: {outdir}/preprocessing/markduplicates/<sample>/

  • <sample>.md.cram and <sample>.md.cram.crai
    • CRAM file and index
  • if --save_output_as_bam:
    • <sample>.md.bam and <sample>.md.bam.bai

Sentieon LocusCollector and Dedup

The subroutines LocusCollector and Dedup are part of Sentieon DNAseq packages with speedup versions of the standard GATK tools, and together those two subroutines correspond to GATK’s MarkDuplicates.

The subroutine LocusCollector collects read information that will be used for removing or tagging duplicate reads; its output is the score file indicating which reads are likely duplicates.

The subroutine Dedup marks or removes duplicate reads based on the score file supplied by LocusCollector, and produces a BAM or CRAM file.

Output files for all samples

Output directory: {outdir}/preprocessing/sentieon_dedup/<sample>/

  • <sample>.dedup.cram and <sample>.dedup.cram.crai
    • CRAM file and index
  • if --save_output_as_bam:
    • <sample>.dedup.bam and <sample>.dedup.bam.bai

Base Quality Score Recalibration

During Base Quality Score Recalibration, systematic errors in the base quality scores are corrected by applying machine learning to detect and correct for them. This is important for evaluating the correct call of a variant during the variant discovery process. However, this is not needed for all combinations of tools in Sarek. Notably, this should be turned off when having UMI tagged reads or using DragMap (see here) as mapper.

For further reading and documentation see the technical documentation by GATK.

GATK BaseRecalibrator (Spark)

GATK BaseRecalibrator generates a recalibration table based on various co-variates.

To use the corresponding spark implementation GATK BaseRecalibratorSpark, please specify --use_gatk_spark baserecalibrator.

Output files for all samples

Output directory: {outdir}/preprocessing/recal_table/<sample>/

  • <sample>.recal.table
    • Recalibration table associated to the duplicates-marked CRAM file.

GATK ApplyBQSR (Spark)

GATK ApplyBQSR recalibrates the base qualities of the input reads based on the recalibration table produced by the GATK BaseRecalibrator tool.

Specify --use_gatk_spark baserecalibrator to use GATK ApplyBQSRSpark instead, the respective spark implementation.

The resulting recalibrated CRAM files are delivered to the user. Recalibrated CRAM files are usually 2-3 times larger than the duplicate-marked CRAM files.

Output files for all samples

Output directory: {outdir}/preprocessing/recalibrated/<sample>/

  • <sample>.recal.cram and <sample>.recal.cram.crai
    • CRAM file and index
  • if --save_output_as_bam:
    • <sample>.recal.bam and <sample>.recal.bam.bai - BAM file and index

CSV files

The CSV files are auto-generated and can be used by Sarek for further processing and/or variant calling.

See the input section in the usage documentation for further reading and documentation on how to make the most of them.

Output files:

Output directory: {outdir}/preprocessing/csv

  • mapped.csv
    • if --save_mapped
    • CSV containing an entry for each sample with the columns patient,sample,sex,status,bam,bai
  • markduplicates_no_table.csv
    • CSV containing an entry for each sample with the columns patient,sample,sex,status,cram,crai
  • markduplicates.csv
    • CSV containing an entry for each sample with the columns patient,sample,sex,status,cram,crai,table
  • recalibrated.csv
    • CSV containing an entry for each sample with the columnspatient,sample,sex,status,cram,crai
  • variantcalled.csv
    • CSV containing an entry for each sample with the columns patient,sample,vcf

Variant Calling

The results regarding variant calling are collected in {outdir}/variantcalling/. If some results from a variant caller do not appear here, please check out the --tools section in the parameter documentation.

(Recalibrated) CRAM files can used as an input to start the variant calling.

SNVs and small indels

For single nucleotide variants (SNVs) and small indels, multiple tools are available for normal (germline), tumor-only, and tumor-normal (somatic) paired data. For a list of the appropriate tool(s) for the data and sequencing type at hand, please check here.

bcftools

bcftools mpileup generates pileup of a CRAM file, followed by bcftools call and filtered with -i 'count(GT==\"RR\")==0. For further reading and documentation see the bcftools manual.

Output files for all samples

Output directory: {outdir}/variantcalling/bcftools/<sample>/

  • <sample>.bcftools.vcf.gz and <sample>.bcftools.vcf.gz.tbi
    • VCF with tabix index

DeepVariant

DeepVariant is a deep learning-based variant caller that takes aligned reads, produces pileup image tensors from them, classifies each tensor using a convolutional neural network and finally reports the results in a standard VCF or gVCF file. For further documentation take a look here.

Output files for normal samples

Output directory: {outdir}/variantcalling/deepvariant/<sample>/

  • <sample>.deepvariant.vcf.gz and <sample>.deepvariant.vcf.gz.tbi
    • VCF with tabix index
  • <sample>.deepvariant.g.vcf.gz and <sample>.deepvariant.g.vcf.gz.tbi
    • gVCF with tabix index

FreeBayes

FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs, indels, MNPs, and complex events smaller than the length of a short-read sequencing alignment. For further reading and documentation see the FreeBayes manual.

Output files for all samples

Output directory: {outdir}/variantcalling/freebayes/{sample,normalsample_vs_tumorsample}/

  • <sample>.freebayes.vcf.gz and <sample>.freebayes.vcf.gz.tbi
    • VCF with tabix index

GATK HaplotypeCaller

GATK HaplotypeCaller calls germline SNPs and indels via local re-assembly of haplotypes.

Output files for normal samples

Output directory: {outdir}/variantcalling/haplotypecaller/<sample>/

  • <sample>.haplotypecaller.vcf.gz and <sample>.haplotypecaller.vcf.gz.tbi
    • VCF with tabix index
GATK Germline Single Sample Variant Calling

GATK Single Sample Variant Calling uses HaplotypeCaller in its default single-sample mode to call variants. The VCF that HaplotypeCaller emits errors on the side of sensitivity, therefore they are filtered by first running the CNNScoreVariants tool. This tool annotates each variant with a score indicating the model’s prediction of the quality of each variant. To apply filters based on those scores run the FilterVariantTranches tool with SNP and INDEL sensitivity tranches appropriate for your task.

If the haplotype-called VCF files are not filtered, then Sarek should be run with at least one of the options --dbsnp or --known_indels.

Output files for normal samples

Output directory: {outdir}/variantcalling/haplotypecaller/<sample>/

  • <sample>.haplotypecaller.filtered.vcf.gz and <sample>.haplotypecaller.filtered.vcf.gz.tbi
    • VCF with tabix index
GATK Joint Germline Variant Calling

GATK Joint germline Variant Calling uses Haplotypecaller per sample in gvcf mode. Next, the gVCFs are consolidated from multiple samples into a GenomicsDB datastore. After joint genotyping, VQSR is applied for filtering to produce the final multisample callset with the desired balance of precision and sensitivity.

Output files from joint germline variant calling

Output directory: {outdir}/variantcalling/haplotypecaller/<sample>/

  • <sample>.haplotypecaller.g.vcf.gz and <sample>.haplotypecaller.g.vcf.gz.tbi
    • gVCF with tabix index

Output directory: {outdir}/variantcalling/haplotypecaller/joint_variant_calling/

  • joint_germline.vcf.gz and joint_germline.vcf.gz.tbi
    • VCF with tabix index
  • joint_germline_recalibrated.vcf.gz and joint_germline_recalibrated.vcf.gz.tbi
    • variant recalibrated VCF with tabix index (if VQSR is applied)

GATK Mutect2

GATK Mutect2 calls somatic SNVs and indels via local assembly of haplotypes. When --joint_mutect2 is used, Mutect2 subworkflow outputs will be saved in a subfolder named with the patient ID and {patient}.mutect2.vcf.gz file will contain variant calls from all of the normal and tumor samples of the patient. For further reading and documentation see the Mutect2 manual. It is not required, but recommended to have a panel of normals (PON) using at least 40 normal samples to get filtered somatic calls. When using --genome GATK.GRCh38, a panel-of-normals file is available. However, it is highly recommended to create one matching your tumor samples. Creating your own panel-of-normals is currently not natively supported by the pipeline. See here for how to create one manually.

Output files for tumor-only and tumor/normal paired samples

Output directory: {outdir}/variantcalling/mutect2/{sample,tumorsample_vs_normalsample,patient}/

Files created:

  • {sample,tumorsample_vs_normalsample,patient}.mutect2.vcf.gz and {sample,tumorsample_vs_normalsample,patient}.mutect2.vcf.gz.tbi
    • unfiltered (raw) Mutect2 calls VCF with tabix index
  • {sample,tumorsample_vs_normalsample,patient}.mutect2.vcf.gz.stats
    • a stats file generated during calling of raw variants (needed for filtering)
  • {sample,tumorsample_vs_normalsample}.mutect2.contamination.table
    • table calculating the fraction of reads coming from cross-sample contamination
  • {sample,tumorsample_vs_normalsample}.mutect2.segmentation.table
    • table containing segmentation of the tumor by minor allele fraction
  • {sample,tumorsample_vs_normalsample,patient}.mutect2.artifactprior.tar.gz
    • prior probabilities for read orientation artifacts
  • {sample,tumorsample,normalsample}.mutect2.pileups.table
    • tabulates pileup metrics for inferring contamination
  • {sample,tumorsample_vs_normalsample,patient}.mutect2.filtered.vcf.gz and {sample,tumorsample_vs_normalsample,patient}.mutect2.filtered.vcf.gz.tbi
    • filtered Mutect2 calls VCF with tabix index based on the probability that a variant is somatic
  • {sample,tumorsample_vs_normalsample,patient}.mutect2.filtered.vcf.gz.filteringStats.tsv
    • a stats file generated during the filtering of Mutect2 called variants

Sentieon DNAscope

Sentieon DNAscope is a variant-caller which aims at outperforming GATK’s Haplotypecaller in terms of both speed and accuracy. DNAscope allows you to use a machine learning model to perform variant calling with higher accuracy by improving the candidate detection and filtering.

Unfiltered VCF-files for normal samples

Output directory: {outdir}/variantcalling/sentieon_dnascope/<sample>/

  • <sample>.dnascope.unfiltered.vcf.gz and <sample>.dnascope.unfiltered.vcf.gz.tbi
    • VCF with tabix index

The output from Sentieon’s DNAscope can be controlled through the option --sentieon_dnascope_emit_mode for Sarek, see Basic usage of Sentieon functions.

Unless dnascope_filter is listed under --skip_tools in the nextflow command, Sentieon’s DNAModelApply is applied to the unfiltered VCF-files in order to obtain filtered VCF-files.

Filtered VCF-files for normal samples

Output directory: {outdir}/variantcalling/sentieon_dnascope/<sample>/

  • <sample>.dnascope.filtered.vcf.gz and <sample>.dnascope.filtered.vcf.gz.tbi
    • VCF with tabix index
Sentieon DNAscope joint germline variant calling

In Sentieon’s package DNAscope, joint germline variant calling is done by first running Sentieon’s Dnacope in emit-mode gvcf for each sample and then running Sentieon’s GVCFtyper on the set of gVCF-files. See Basic usage of Sentieon functions for information on how joint germline variant calling can be done in Sarek using Sentieon’s DNAscope.

Output files from joint germline variant calling

Output directory: {outdir}/variantcalling/sentieon_dnascope/<sample>/

  • <sample>.dnascope.g.vcf.gz and <sample>.dnascope.g.vcf.gz.tbi
    • VCF with tabix index

Output directory: {outdir}/variantcalling/sentieon_dnascope/joint_variant_calling/

  • joint_germline.vcf.gz and joint_germline.vcf.gz.tbi
    • VCF with tabix index

Sentieon Haplotyper

Sentieon Haplotyper is Sention’s speedup version of GATK’s Haplotypecaller (see above).

Unfiltered VCF-files for normal samples

Output directory: {outdir}/variantcalling/sentieon_haplotyper/<sample>/

  • <sample>.haplotyper.unfiltered.vcf.gz and <sample>.haplotyper.unfiltered.vcf.gz.tbi
    • VCF with tabix index

The output from Sentieon’s Haplotyper can be controlled through the option --sentieon_haplotyper_emit_mode for Sarek, see Basic usage of Sentieon functions.

Unless haplotyper_filter is listed under --skip_tools in the nextflow command, GATK’s CNNScoreVariants and FilterVariantTranches (see above) is applied to the unfiltered VCF-files in order to obtain filtered VCF-files.

Filtered VCF-files for normal samples

Output directory: {outdir}/variantcalling/sentieon_haplotyper/<sample>/

  • <sample>.haplotyper.filtered.vcf.gz and <sample>.haplotyper.filtered.vcf.gz.tbi
    • VCF with tabix index
Sentieon Haplotyper joint germline variant calling

In Sentieon’s package DNAseq, joint germline variant calling is done by first running Sentieon’s Haplotyper in emit-mode gvcf for each sample and then running Sentieon’s GVCFtyper on the set of gVCF-files. See Basic usage of Sentieon functions for information on how joint germline variant calling can be done in Sarek using Sentieon’s DNAseq. After joint genotyping, Sentieon’s version of VQSR (VarCal and ApplyVarCal) is applied for filtering to produce the final multisample callset with the desired balance of precision and sensitivity.

Output files from joint germline variant calling

Output directory: {outdir}/variantcalling/sentieon_haplotyper/<sample>/

  • <sample>.haplotyper.g.vcf.gz and <sample>.haplotyper.g.vcf.gz.tbi
    • VCF with tabix index

Output directory: {outdir}/variantcalling/sentieon_haplotyper/joint_variant_calling/

  • joint_germline.vcf.gz and joint_germline.vcf.gz.tbi
    • VCF with tabix index
  • joint_germline_recalibrated.vcf.gz and joint_germline_recalibrated.vcf.gz.tbi
    • variant recalibrated VCF with tabix index (if VarCal is applied)

Strelka

Strelka is a fast and accurate small variant caller optimized for analysis of germline variation in small cohorts and somatic variation in tumor/normal sample pairs. For further reading and documentation see the Strelka user guide. If Strelka is used for somatic variant calling and Manta is also specified in tools, the output candidate indels from Manta are used according to Strelka Best Practices. For further downstream analysis, take a look here.

Output files for all single samples (normal or tumor-only)

Output directory: {outdir}/variantcalling/strelka/<sample>/

  • <sample>.strelka.genome.vcf.gz and <sample>.strelka.genome.vcf.gz.tbi
    • genome VCF with tabix index
  • <sample>.strelka.variants.vcf.gz and <sample>.strelka.variants.vcf.gz.tbi
    • VCF with tabix index with all potential variant loci across the sample. Note this file includes non-variant loci if they have a non-trivial level of variant evidence or contain one or more alleles for which genotyping has been forced.
Output files for tumor/normal paired samples

Output directory: {outdir}/variantcalling/strelka/<tumorsample_vs_normalsample>/

  • <tumorsample_vs_normalsample>.strelka.somatic_indels.vcf.gz and <tumorsample_vs_normalsample>.strelka.somatic_indels.vcf.gz.tbi
    • VCF with tabix index with all somatic indels inferred in the tumor sample.
  • <tumorsample_vs_normalsample>.strelka.somatic_snvs.vcf.gz and <tumorsample_vs_normalsample>.strelka.somatic_snvs.vcf.gz.tbi
    • VCF with tabix index with all somatic SNVs inferred in the tumor sample.

Lofreq

Lofreq is a fast and sensitive variant-caller for inferring SNVs and indels from next-generation sequencing data. It makes full use of base-call qualities and other sources of errors inherent in sequencing, which are usually ignored by other methods or only used for filtering. For further reading and documentation see the Lofreq user guide.

Output files for tumor-only samples

Output directory: {outdir}/variant_calling/lofreq/<sample>/

-<tumorsample>.vcf.gz -VCF which provides a detailed description of the detected genetic variants.

Structural Variants

Manta

Manta calls structural variants (SVs) and indels from mapped paired-end sequencing reads. It is optimized for analysis of germline variation in small sets of individuals and somatic variation in tumor/normal sample pairs. Manta provides a candidate list for small indels that can be fed to Strelka following Strelka Best Practices. For further reading and documentation see the Manta user guide.

Output files for normal samples

Output directory: {outdir}/variantcalling/manta/<sample>/

  • <sample>.manta.diploid_sv.vcf.gz and <sample>.manta.diploid_sv.vcf.gz.tbi
    • VCF with tabix index containing SVs and indels scored and genotyped under a diploid model for the sample.
Output files for tumor-only samples

Output directory: {outdir}/variantcalling/manta/<sample>/

  • <sample>.manta.tumor_sv.vcf.gz and <sample>.manta.tumor_sv.vcf.gz.tbi
    • VCF with tabix index containing a subset of the candidateSV.vcf.gz file after removing redundant candidates and small indels less than the minimum scored variant size (50 by default). The SVs are not scored, but include additional details: (1) paired and split read supporting evidence counts for each allele (2) a subset of the filters from the scored tumor-normal model are applied to the single tumor case to improve precision.
Output files for tumor/normal paired samples

Output directory: {outdir}/variantcalling/manta/<tumorsample_vs_normalsample>/

  • <tumorsample_vs_normalsample>.manta.diploid_sv.vcf.gz and <tumorsample_vs_normalsample>.manta.diploid_sv.vcf.gz.tbi
    • VCF with tabix index containing SVs and indels scored and genotyped under a diploid model for the sample. In the case of a tumor/normal subtraction, the scores in this file do not reflect any information from the tumor sample.
  • <tumorsample_vs_normalsample>.manta.somatic_sv.vcf.gz and <tumorsample_vs_normalsample>.manta.somatic_sv.vcf.gz.tbi
    • VCF with tabix index containing SVs and indels scored under a somatic variant model.

TIDDIT

TIDDIT identifies intra and inter-chromosomal translocations, deletions, tandem-duplications and inversions. For further reading and documentation see the TIDDIT manual.

Output files for normal and tumor-only samples

Output directory: {outdir}/variantcalling/tiddit/<sample>/

  • <sample>.tiddit.vcf.gz and <sample>.tiddit.vcf.gz.tbi
    • VCF with tabix index containing SV calls
  • <sample>.tiddit.ploidies.tab
    • tab file describing the estimated ploidy and coverage across each contig
Output files for tumor/normal paired samples

Output directory: {outdir}/variantcalling/tiddit/<tumorsample_vs_normalsample>/

  • <tumorsample_vs_normalsample>.tiddit.normal.vcf.gz and <tumorsample_vs_normalsample>.tiddit.normal.vcf.gz.tbi
    • VCF with tabix index containing SV calls
  • <tumorsample_vs_normalsample>.tiddit.tumor.vcf.gz and <tumorsample_vs_normalsample>.tiddit.tumor.vcf.gz.tbi
    • VCF with tabix index containing SV calls
  • <tumorsample_vs_normalsample>_sv_merge.tiddit.vcf.gz and <tumorsample_vs_normalsample>_sv_merge.tiddit.vcf.gz.tbi
    • merged tumor/normal VCF with tabix index
  • <tumorsample_vs_normalsample>.tiddit.ploidies.tab
    • tab file describing the estimated ploidy and coverage across each contig

Sample heterogeneity, ploidy and CNVs

ASCAT

ASCAT is a software for performing allele-specific copy number analysis of tumor samples and for estimating tumor ploidy and purity (normal contamination). It infers tumor purity and ploidy and calculates whole-genome allele-specific copy number profiles. The ASCAT process gives several images as output, described in detail in this book chapter. Running ASCAT on NGS data requires that the BAM files are converted into BAF and LogR values. This is done internally using the software AlleleCount. For further reading and documentation see the ASCAT manual.

Output files for tumor/normal paired samples

Output directory: {outdir}/variantcalling/ascat/<tumorsample_vs_normalsample>/

  • <tumorsample_vs_normalsample>.tumour.ASCATprofile.png
    • image with information about allele-specific copy number profile
  • <tumorsample_vs_normalsample>.tumour.ASPCF.png
    • image with information about allele-specific copy number segmentation
  • <tumorsample_vs_normalsample>.before_correction_Tumour.<tumorsample_vs_normalsample>.tumour.png
    • image with information about raw profile of tumor sample of logR and BAF values before GC correction
  • <tumorsample_vs_normalsample>.before_correction_Tumour.<tumorsample_vs_normalsample>.germline.png
    • image with information about raw profile of normal sample of logR and BAF values before GC correction
  • <tumorsample_vs_normalsample>.after_correction_GC_Tumour.<tumorsample_vs_normalsample>.tumour.png
    • image with information about GC and RT corrected logR and BAF values of tumor sample after GC correction
  • <tumorsample_vs_normalsample>.after_correction_GC_Tumour.<tumorsample_vs_normalsample>.germline.png
    • image with information about GC and RT corrected logR and BAF values of normal sample after GC correction
  • <tumorsample_vs_normalsample>.tumour.sunrise.png
    • image visualising the range of ploidy and tumor percentage values
  • <tumorsample_vs_normalsample>.metrics.txt
    • file with information about different metrics from ASCAT profiles
  • <tumorsample_vs_normalsample>.cnvs.txt
    • file with information about CNVS
  • <tumorsample_vs_normalsample>.purityploidy.txt
    • file with information about purity and ploidy
  • <tumorsample_vs_normalsample>.segments.txt
    • file with information about copy number segments
  • <tumorsample_vs_normalsample>.tumour_tumourBAF.txt and <tumorsample_vs_normalsample>.tumour_normalBAF.txt
    • file with beta allele frequencies
  • <tumorsample_vs_normalsample>.tumour_tumourLogR.txt and <tumorsample_vs_normalsample>.tumour_normalLogR.txt
    • file with total copy number on a logarithmic scale

The text file <tumorsample_vs_normalsample>.cnvs.txt contains predictions about copy number state for all the segments. The output is a tab delimited text file with the following columns:

  • chr: chromosome number
  • startpos: start position of the segment
  • endpos: end position of the segment
  • nMajor: number of copies of one of the allels (for example the chromosome inherited of one parent)
  • nMinor: number of copies of the other allele (for example the chromosome inherited of the other parent)

The file <tumorsample_vs_normalsample>.cnvs.txt contains all segments predicted by ASCAT, both those with normal copy number (nMinor = 1 and nMajor =1) and those corresponding to copy number aberrations.

CNVKit

CNVKit is a toolkit to infer and visualize copy number from high-throughput DNA sequencing data. It is designed for use with hybrid capture, including both whole-exome and custom target panels, and short-read sequencing platforms such as Illumina. For further reading and documentation, see the CNVKit Documentation

Output files for normal and tumor-only samples

Output directory: {outdir}/variantcalling/cnvkit/<sample>/

  • <sample>.antitargetcoverage.cnn
    • file containing coverage information
  • <sample>.targetcoverage.cnn
    • file containing coverage information
  • <sample>-diagram.pdf
    • file with plot of copy numbers or segments on chromosomes
  • <sample>-scatter.png
    • file with plot of bin-level log2 coverages and segmentation calls
  • <sample>.bintest.cns
    • file containing copy number segment information
  • <sample>.cnr
    • file containing copy number ratio information
  • <sample>.cns
    • file containing copy number segment information
  • <sample>.call.cns
    • file containing copy number segment information
  • <sample>.genemetrics.tsv
    • file containing per gene copy number information (if input files are annotated)
Output files for tumor/normal samples

Output directory: {outdir}/variantcalling/cnvkit/<tumorsample_vs_normalsample>/

  • <normalsample>.antitargetcoverage.cnn
    • file containing coverage information
  • <normalsample>.targetcoverage.cnn
    • file containing coverage information
  • <tumorsample>.antitargetcoverage.cnn
    • file containing coverage information
  • <tumorsample>.targetcoverage.cnn
    • file containing coverage information
  • <tumorsample>.bintest.cns
    • file containing copy number segment information
  • <tumorsample>-scatter.png
    • file with plot of bin-level log2 coverages and segmentation calls
  • <tumorsample>-diagram.pdf
    • file with plot of copy numbers or segments on chromosomes
  • <tumorsample>.cnr
    • file containing copy number ratio information
  • <tumorsample>.cns
    • file containing copy number segment information
  • <tumorsample>.call.cns
    • file containing copy number segment information
  • <tumorsample>.genemetrics.tsv
    • file containing per gene copy number information (if input files are annotated)

Control-FREEC

Control-FREEC is a tool for detection of copy-number changes and allelic imbalances (including loss of heterozygoity (LOH)) using deep-sequencing data. Control-FREEC automatically computes, normalizes, segments copy number and beta allele frequency profiles, then calls copy number alterations and LOH. It also detects subclonal gains and losses and evaluates the most likely average ploidy of the sample. For further reading and documentation see the Control-FREEC Documentation.

Output files for tumor-only and tumor/normal paired samples

Output directory: {outdir}/variantcalling/controlfreec/{tumorsample,tumorsample_vs_normalsample}/

  • config.txt
    • Configuration file used to run Control-FREEC
  • <tumorsample>_BAF.png and <tumorsample_vs_normalsample>_BAF.png
    • image of BAF plot
  • <tumorsample>_ratio.log2.png and <tumorsample_vs_normalsample>_ratio.log2.png
    • image of ratio log2 plot
  • <tumorsample>_ratio.png and <tumorsample_vs_normalsample>_ratio.png
    • image of ratio plot
  • <tumorsample>.bed and <tumorsample_vs_normalsample>.bed
    • translated output to a .BED file (so to view it in the UCSC Genome Browser)
  • <tumorsample>.circos.txt and <tumorsample_vs_normalsample>.circos.txt
    • translated output to the Circos format
  • <tumorsample>.p.value.txt and <tumorsample_vs_normalsample>.p.value.txt
    • CNV file containing p_values for each call
  • <tumorsample>_BAF.txt and <tumorsample_vs_normalsample>.mpileup.gz_BAF.txt
    • file with beta allele frequencies for each possibly heterozygous SNP position
  • <tumorsample_vs_normalsample>.tumor.mpileup.gz_CNVs
    • file with coordinates of predicted copy number alterations
  • <tumorsample>_info.txt and <tumorsample_vs_normalsample>.tumor.mpileup.gz_info.txt
    • parsable file with information about FREEC run
  • <tumorsample>_ratio.BedGraph and <tumorsample_vs_normalsample>.tumor.mpileup.gz_ratio.BedGraph
    • file with ratios in BedGraph format for visualization in the UCSC genome browser. The file contains tracks for normal copy number, gains and losses, and copy neutral LOH (*).
  • <tumorsample>_ratio.txt and <tumorsample_vs_normalsample>.tumor.mpileup.gz_ratio.txt
    • file with ratios and predicted copy number alterations for each window
  • <tumorsample>_sample.cpn and <tumorsample_vs_normalsample>.tumor.mpileup.gz_sample.cpn
    • files with raw copy number profiles for the tumor sample
  • <tumorsample_vs_normalsample>.normal.mpileup.gz_control.cpn
    • files with raw copy number profiles for the control sample
  • <GC_profile.<tumorsample>.cpn>
    • file with GC-content profile

Microsatellite instability (MSI)

Microsatellite instability is a genetic condition associated with deficiencies in the mismatch repair (MMR) system which causes a tendency to accumulate a high number of mutations (SNVs and indels). An altered distribution of microsatellite length is associated with a missed replication slippage which would be corrected under normal MMR conditions.

MSIsensorPro

MSIsensorPro is a tool to detect the MSI status of a tumor scanning the length of the microsatellite regions. It requires a normal sample for each tumour to differentiate the somatic and germline cases. For further reading see the MSIsensor paper.

Output files for tumor/normal paired samples

Output directory: {outdir}/variantcalling/msisensor/<tumorsample_vs_normalsample>/

  • <tumorsample_vs_normalsample>
    • MSI score output, contains information about the number of somatic sites.
  • <tumorsample_vs_normalsample>_dis
    • The normal and tumor length distribution for each microsatellite position.
  • <tumorsample_vs_normalsample>_germline
    • Somatic sites detected.
  • <tumorsample_vs_normalsample>_somatic
    • Germline sites detected.

Concatenation

Germline VCFs from DeepVariant, FreeBayes, HaplotypeCaller, Haplotyper, Manta, bcftools mpileup, Strelka, or Tiddit are concatenated with bcftools concat. The field SOURCE is added to the VCF header to report the variant caller.

Concatenated VCF-files for normal samples

Output directory: {outdir}/variantcalling/concat/<sample>/

  • <sample>.germline.vcf.gz and <sample>.germline.vcf.gz.tbi
    • VCF with tabix index

Variant annotation

This directory contains results from the final annotation steps: two tools are used for annotation, snpEff and VEP. Both results can also be combined by setting --tools merge. All variants present in the called VCF files are annotated. For some variant callers this can mean that the variants are already filtered by PASS, for some this needs to be done during post-processing.

snpEff

snpeff is a genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes) using multiple databases for annotations. The generated VCF header contains the software version and the used command line. For further reading and documentation see the snpEff manual.

Output files for all samples

Output directory: {outdir}/annotation/{sample,tumorsample_vs_normalsample}

  • {sample,tumorsample_vs_normalsample}.<variantcaller>_snpEff.ann.vcf.gz and {sample,tumorsample_vs_normalsample}.<variantcaller>_snpEff.ann.vcf.gz.tbi
    • VCF with tabix index

VEP

VEP (Variant Effect Predictor), based on Ensembl, is a tool to determine the effects of all sorts of variants, including SNPs, indels, structural variants, CNVs. The generated VCF header contains the software version, also the version numbers for additional databases like Clinvar or dbSNP used in the VEP line. The format of the consequence annotations is also in the VCF header describing the INFO field. For further reading and documentation see the VEP manual.

Currently, it contains:

  • Consequence: impact of the variation, if there is any
  • Codons: the codon change, i.e. cGt/cAt
  • Amino_acids: change in amino acids, i.e. R/H if there is any
  • Gene: ENSEMBL gene name
  • SYMBOL: gene symbol
  • Feature: actual transcript name
  • EXON: affected exon
  • PolyPhen: prediction based on PolyPhen
  • SIFT: prediction by SIFT
  • Protein_position: Relative position of amino acid in protein
  • BIOTYPE: Biotype of transcript or regulatory feature

plus any additional filed selected via the plugins: dbNSFP, LOFTEE, SpliceAI, SpliceRegion.

Output files for all samples

Output directory: {outdir}/annotation/{sample,tumorsample_vs_normalsample}

  • {sample,tumorsample_vs_normalsample}.<variantcaller>_VEP.ann.vcf.gz and {sample,tumorsample_vs_normalsample}.<variantcaller>_VEP.ann.vcf.gz.tbi
    • VCF with tabix index

BCFtools annotate

BCFtools annotate is used to add annotations to VCF files. The annotations are added to the INFO column of the VCF file. The annotations are added to the VCF header and the VCF header is updated with the new annotations. For further reading and documentation see the BCFtools annotate manual.

Output files for all samples
  • {sample,tumorsample_vs_normalsample}.<variantcaller>_bcf.ann.vcf.gz and {sample,tumorsample_vs_normalsample}.<variantcaller>_bcf.ann.vcf.gz.tbi
    • VCF with tabix index

Quality control and reporting

Quality control

FastQC

FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.

The plots display:

  • Sequence counts for each sample.
  • Sequence Quality Histograms: The mean quality value across each base position in the read.
  • Per Sequence Quality Scores: The number of reads with average quality scores. Shows if a subset of reads has poor quality.
  • Per Base Sequence Content: The proportion of each base position for which each of the four normal DNA bases has been called.
  • Per Sequence GC Content: The average GC content of reads. Normal random library typically have a roughly normal distribution of GC content.
  • Per Base N Content: The percentage of base calls at each position for which an N was called.
  • Sequence Length Distribution.
  • Sequence Duplication Levels: The relative level of duplication found for each sequence.
  • Overrepresented sequences: The total amount of overrepresented sequences found in each library.
  • Adapter Content: The cumulative percentage count of the proportion of your library which has seen each of the adapter sequences at each position.
Output files for all samples

Output directory: {outdir}/reports/fastqc/<sample-lane>

  • <sample-lane_1>_fastqc.html and <sample-lane_2>_fastqc.html
    • FastQC report containing quality metrics for your untrimmed raw FastQ files
  • <sample-lane_1>_fastqc.zip and <sample-lane_2>_fastqc.zip
    • Zip archive containing the FastQC report, tab-delimited data file and plot images
Note

The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.

FastP

FastP is a tool designed to provide all-in-one preprocessing for FastQ files and is used for trimming and splitting. The tool then determines QC metrics for the processed reads.

Output files for all samples

Output directory: {outdir}/reports/fastp/<sample>

  • <sample-lane>_fastp.html
    • report in HTML format
  • <sample-lane>_fastp.json
    • report in JSON format
  • <sample-lane>_fastp.log
    • FastQ log file

Mosdepth

Mosdepth reports information for the evaluation of the quality of the provided alignment data. In short, the basic statistics of the alignment (number of reads, coverage, GC-content, etc.) are summarized and a number of useful graphs are produced. For further reading and documentation see the Mosdepth documentation.

Plots will show:

  • cumulative coverage distribution
  • absolute coverage distribution
  • average coverage per contig/chromosome
Output files for all samples

Output directory: {outdir}/reports/mosdepth/<sample>

  • <sample>.{sorted,md,recal}.mosdepth.global.dist.txt
    • file used by MultiQC, if .region file does not exist
  • <sample>.{sorted,md,recal}.mosdepth.region.dist.txt
  • <sample>.{sorted,md,recal}.mosdepth.summary.txt -A summary of mean depths per chromosome and within specified regions per chromosome.
  • <sample>.{sorted,md,recal}.{per-base,regions}.bed.gz
    • per-base depth for targeted data, per-window (500bp) depth of WGS
  • <sample>.{sorted,md,recal}.regions.bed.gz.csi
    • CSI index for per-base depth for targeted data, per-window (500bp) depth of WGS

NGSCheckMate

NGSCheckMate is a tool for determining whether samples come from the same genetic individual, using a set of commonly heterozygous SNPs. This enables for the detecting of sample mislabelling events. The output includes a text file indicating whether samples have matched or not according to the algorithm, as well as a dendrogram visualising these results.

Output files for all samples

Output directory: {outdir}/reports/ngscheckmate/

  • ngscheckmate_all.txt
    • Tab delimited text file listing all the comparisons made, whether they were considered as a match, with the correlation and a normalised depth.
  • ngscheckmate_matched.txt
    • Tab delimited text file listing only the comparison that were considered to match, with the correlation and a normalised depth.
  • ngscheckmate_output_corr_matrix.txt
    • Tab delimited text file containing a matrix of all correlations for all comparisons made.
  • vcfs/<sample>.vcf.gz
    • Set of vcf files for each sample. Contains calls for the set of SNP positions used to calculate sample relatedness.

GATK MarkDuplicates reports

More information in the GATK MarkDuplicates section

Duplicates can arise during sample preparation e.g. library construction using PCR. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates. If GATK MarkDuplicates is used, the metrics file generated by the tool is used, if GATK MarkDuplicatesSpark is used the report is generated by GATK4 EstimateLibraryComplexity on the mapped BAM files. For further reading and documentation see the MarkDuplicates manual.

The plot will show:

  • duplication statistics
Output files for all samples

Output directory: {outdir}/reports/markduplicates/<sample>

  • <sample>.md.cram.metrics

Sentieon Dedup reports

Sentieon’s DNAseq subroutine Dedup produces a metrics report much like the one produced by GATK’s MarkDuplicates. The Dedup metrics are imported into MultiQC as custom content and displayed in a table.

Output files for all samples

Output directory: {outdir}/reports/sentieon_dedup/<sample>

  • <sample>.dedup.cram.metrics

samtools stats

samtools stats collects statistics from CRAM files and outputs in a text format. For further reading and documentation see the samtools manual.

The plots will show:

  • Alignment metrics.
Output files for all samples

Output directory: {outdir}/reports/samtools/<sample>

  • <sample>.{sorted,md,recal}.samtools.stats.out
    • Raw statistics used by MultiQC

bcftools stats

bcftools stats produces a statistics text file which is suitable for machine processing and can be plotted using plot-vcfstats. For further reading and documentation see the bcftools stats manual.

Plots will show:

  • Stats by non-reference allele frequency, depth distribution, stats by quality and per-sample counts, singleton stats, etc.
  • Note: When using Strelka, there will be no depth distribution plot, as Strelka does not report the INFO/DP field
Output files for all samples

Output directory: {outdir}/reports/bcftools/

  • <sample>.<variantcaller>.bcftools_stats.txt
    • Raw statistics used by MultiQC

VCFtools

VCFtools is a program package designed for working with VCF files. For further reading and documentation see the VCFtools manual.

Plots will show:

  • the summary counts of each type of transition to transversion ratio for each FILTER category.
  • the transition to transversion ratio as a function of alternative allele count (using only bi-allelic SNPs).
  • the transition to transversion ratio as a function of SNP quality threshold (using only bi-allelic SNPs).
Output files for all samples

Output directory: {outdir}/reports/vcftools/

  • <sample>.<variantcaller>.FILTER.summary
    • Raw statistics used by MultiQC with a summary of the number of SNPs and Ts/Tv ratio for each FILTER category
  • <sample>.<variantcaller>.TsTv.count
    • Raw statistics used by MultiQC with the Transition / Transversion ratio as a function of alternative allele count. Only uses bi-allelic SNPs.
  • <sample>.<variantcaller>.TsTv.qual
    • Raw statistics used by MultiQC with Transition / Transversion ratio as a function of SNP quality threshold. Only uses bi-allelic SNPs.

snpEff reports

snpeff is a genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes) using multiple databases for annotations. For further reading and documentation see the snpEff manual.

The plots will show:

  • locations of detected variants in the genome and the number of variants for each location.
  • the putative impact of detected variants and the number of variants for each impact.
  • the effect of variants at protein level and the number of variants for each effect type.
  • the quantity as function of the variant quality score.
Output files for all samples

Output directory: {outdir}/reports/SnpEff/{sample,tumorsample_vs_normalsample}/<variantcaller>/

  • <sample>.<variantcaller>_snpEff.csv
  • <sample>.<variantcaller>_snpEff.html
    • Statistics to be visualised with a web browser
  • <sample>.<variantcaller>_snpEff.genes.txt
    • TXT (tab separated) summary counts for variants affecting each transcript and gene

VEP reports

VEP (Variant Effect Predictor), based on Ensembl, is a tool to determine the effects of all sorts of variants, including SNPs, indels, structural variants, CNVs. For further reading and documentation see the VEP manual

Output files for all samples

Output directory: {outdir}/reports/EnsemblVEP/{sample,tumorsamplt_vs_normalsample}/<variantcaller>/

  • <sample>.<variantcaller>_VEP.summary.html
    • Summary of the VEP run to be visualised with a web browser

Reporting

MultiQC

MultiQC is a visualization tool that generates a single HTML report summarizing all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. Results generated by MultiQC collect pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Output files
  • multiqc/
    • multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
    • multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
    • multiqc_plots/: directory containing static images from the report in various formats.

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report_<timestamp>.html, execution_timeline_<timestamp>.html, execution_trace_<timestamp>.txt, pipeline_dag_<timestamp>.dot/pipeline_dag_<timestamp>.svg and manifest_<timestamp>.bco.json.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
    • Parameters used by the pipeline run: params_<timestamp>.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

Reference files

Contains reference folders generated by the pipeline. These files are only published, if --save_reference is set.

Output files
  • bwa/
    • Index corresponding to the BWA aligner
  • bwamem2/
    • Index corresponding to the BWA-mem2 aligner
  • cnvkit/
    • Reference files generated by CNVKit
  • dragmap/
    • Index corresponding to the DragMap aligner
  • dbsnp/
    • Tabix index generated by Tabix from the given dbsnp file
  • dict/
  • fai/
  • germline_resource/
    • Tabix index generated by Tabix from the given gernline resource file
  • intervals/
    • Bed files in various stages: .bed, .bed.gz, .bed per chromosome, .bed.gz per chromsome
  • known_indels/
    • Tabix index generated by Tabix from the given known indels file
  • msi/
    • MSIsensorPro scan of the reference genome to get microsatellites information
  • pon/
    • Tabix index generated by Tabix from the given panel-of-normals file