nf-core/sarek
Edit

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing

annotationcancergatk4genomicsgermlinepre-processingsomatictarget-panelsvariant-callingwhole-exome-sequencingwhole-genome-sequencing

Launch version 3.9.0 https://github.com/nf-core/sarek

Output

Introduction

This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

Directory Structure
Preprocessing
Variant Calling
Post variant calling
Variant annotation
Quality control and reporting
Reference files

Directory Structure

The default directory structure is as follows

{outdir}
├── csv
├── multiqc
├── pipeline_info
├── preprocessing
│   ├── markduplicates
│       └── <sample>
│   ├── recal_table
│       └── <sample>
│   └── recalibrated
│       └── <sample>
├── reference
└── reports
    ├── <tool1>
    └── <tool2>
work/
.nextflow.log

Preprocessing

Sarek pre-processes raw FastQ files or unmapped BAM files, based on GATK best practices.

Preparation of input files (FastQ or (u)BAM)

FastP is a tool designed to provide all-in-one preprocessing for FastQ files and as such is used for trimming and splitting. By default, these files are not published. However, if publishing is enabled, please be aware that these files are only published once, meaning if trimming and splitting is enabled, then the resulting files will be sharded FastQ files with trimmed reads. If only one of them is enabled then the files contain either trimmed or split reads, respectively.

Clip and filter read length

FastP enables efficient clipping of reads from either the 5’ end (--clip_r1, --clip_r2) or the 3’ end (--three_prime_clip_r1, --three_prime_clip_r2). Additionally, FastP allows the filtering of reads based on insert size by specifying a minimum required length with the --length_required parameter (default: 15bp). It is recommended to optimize these parameters according to the specific characteristics of your data.

Trim adapters

FastP supports global trimming, which means it trims all reads in the front or the tail. This function is useful since sometimes you want to drop some cycles of a sequencing run. In the current implementation in Sarek --detect_adapter_for_pe is set by default which enables auto-detection of adapter sequences. For more information on how to fine-tune adapter trimming, take a look into the parameter docs.

The resulting files are intermediate and by default not kept in the final files delivered to users. Set --save_trimmed to enable publishing of the files in:

Output files for all samples

Output directory: {outdir}/preprocessing/fastp/<sample>

<sample>_<lane>_{1,2}.fastp.fastq.gz>
- Bgzipped FastQ file

Split FastQ files

FastP supports splitting of one FastQ file into multiple files allowing parallel alignment of sharded FastQ file. To enable splitting, the number of reads per output can be specified. For more information, take a look into the parameter --split_fastqin the parameter docs.

These files are intermediate and by default not placed in the output-folder kept in the final files delivered to users. Set --save_split to enable publishing of these files to:

Output files for all samples

Output directory: {outdir}/preprocessing/fastp/<sample>/

<sample_lane_{1,2}.fastp.fastq.gz>
- Bgzipped FastQ file

UMI consensus

Sarek can create consensus reads when Unique Molecular Identifiers (UMIs) exist, using fgbio tools. Please note that if your UMIs are part of additional index fastq files then you can use nf-core/fastquorum to process them.

These files are intermediate and by default not placed in the output-folder kept in the final files delivered to users. Set --save_split to enable publishing of these files to:

Output files for all samples

Output directory: {outdir}/preprocessing/umi/<sample>/

<sample_lane_{1,2}.umi-consensus.bam>

Output directory: {outdir}/reports/umi/

<sample_lane_{1,2}_umi_histogram.txt>

BBSplit contamination removal

BBSplit is a tool that bins reads by mapping to multiple references simultaneously, using BBMap. The reads go to the bin of the reference they map to best. There are also disambiguation options, such that reads that map to multiple references can be binned with all of them, none of them, one of them, or put in a special “ambiguous” file for each of them.

This functionality would be especially useful, for example, if you have mouse PDX samples that contain a mixture of human and mouse genomic DNA/RNA and you would like to filter out any mouse derived reads.

The BBSplit index will have to be built at least once with this pipeline by providing --bbsplit_fasta_list which has to be a file containing 2 columns: short name and full path to reference genome(s):

mm10,/path/to/mm10.fa
ecoli,/path/to/ecoli.fa
sarscov2,/path/to/sarscov2.fa

You can save the index by using the --save_reference parameter and then provide it via --bbsplit_index for future runs. To enable the tool add --tools bbsplit to the run parameters. As described in the Output files dropdown box above the FastQ files relative to the main reference genome will always be called *primary*.fastq.gz.

By default, the following parameters are used for BBSplit ambiguous2=best maxindel=150000. To overwrite these parameters, use a custom config, as described here.

Output files

preprocessing/bbsplit/
- *.fastq.gz: If --save_bbsplit_reads is specified FastQ files split by reference will be saved to the results directory. Reads from the main reference genome will be named “primary.fastq.gz”. Reads from contaminating genomes will be named “<SHORT_NAME>.fastq.gz” where <SHORT_NAME> is the first column in --bbsplit_fasta_list that needs to be provided to initially build the index.
- *.txt: File containing statistics on how many reads were assigned to each reference.

Map to Reference

BWA

BWA is a software package for mapping low-divergent sequences against a large reference genome. The aligned reads are then coordinate-sorted (or name-sorted if GATK MarkDuplicatesSpark is used for duplicate marking) with samtools.

BWA-mem2

BWA-mem2 is a software package for mapping low-divergent sequences against a large reference genome.The aligned reads are then coordinate-sorted (or name-sorted if GATK MarkDuplicatesSpark is used for duplicate marking) with samtools.

DragMap

DragMap is an open-source software implementation of the DRAGEN mapper, which the Illumina team created so that we would have an open-source way to produce the same results as their proprietary DRAGEN hardware. The aligned reads are then coordinate-sorted (or name-sorted if GATK MarkDuplicatesSpark is used for duplicate marking) with samtools.

These files are intermediate and by default not placed in the output-folder kept in the final files delivered to users. Set --save_mapped to enable publishing, furthermore add the flag save_output_as_bam for publishing in BAM format.

Sentieon BWA mem

Sentieon bwa mem is a subroutine for mapping low-divergent sequences against a large reference genome. It is part of the proprietary software package DNAseq from Sentieon.

The aligned reads are coordinate-sorted with Sentieon.

Output files for all mappers and samples

The alignment files (BAM or CRAM) produced by the chosen aligner are not published by default. CRAM output files will not be saved in the output-folder (outdir), unless the flag --save_mapped is used. BAM output can be selected by setting the flag --save_output_as_bam.

Output directory: {outdir}/preprocessing/mapped/<sample>/

if --save_mapped: <sample>.sorted.cram and <sample>.sorted.cram.crai
- CRAM file and index
if --save_mapped --save_output_as_bam: <sample>.sorted.bam and <sample>.sorted.bam.bai
- BAM file and index

Mark Duplicates

During duplicate marking, read pairs that are likely to have originated from duplicates of the same original DNA fragments through some artificial processes are identified. These are considered to be non-independent observations, so all but a single read pair within each set of duplicates are marked, causing the marked pairs to be ignored by default during the variant discovery process.

For further reading and documentation see the data pre-processing for variant discovery from the GATK best practices.

GATK MarkDuplicates (Spark)

By default, Sarek will use GATK MarkDuplicates.

To use the corresponding spark implementation GATK MarkDuplicatesSpark, please specify --use_gatk_spark markduplicates. The resulting files are converted to CRAM with either samtools, when GATK MarkDuplicates is used, or, implicitly, by GATK MarkDuplicatesSpark.

The resulting CRAM files are delivered to the users.

Output files for all samples

Output directory: {outdir}/preprocessing/markduplicates/<sample>/

<sample>.md.cram and <sample>.md.cram.crai
- CRAM file and index
if --save_output_as_bam:
- <sample>.md.bam and <sample>.md.bam.bai

Sentieon LocusCollector and Dedup

The subroutines LocusCollector and Dedup are part of Sentieon DNAseq packages with speedup versions of the standard GATK tools, and together those two subroutines correspond to GATK’s MarkDuplicates.

The subroutine LocusCollector collects read information that will be used for removing or tagging duplicate reads; its output is the score file indicating which reads are likely duplicates.

The subroutine Dedup marks or removes duplicate reads based on the score file supplied by LocusCollector, and produces a BAM or CRAM file.

Output files for all samples

Output directory: {outdir}/preprocessing/sentieon_dedup/<sample>/

<sample>.dedup.cram and <sample>.dedup.cram.crai
- CRAM file and index
if --save_output_as_bam:
- <sample>.dedup.bam and <sample>.dedup.bam.bai

Base Quality Score Recalibration

During Base Quality Score Recalibration, systematic errors in the base quality scores are corrected by applying machine learning to detect and correct for them. This is important for evaluating the correct call of a variant during the variant discovery process. However, this is not needed for all combinations of tools in Sarek. Notably, this should be turned off when having UMI tagged reads or using DragMap (see here) as mapper.

For further reading and documentation see the technical documentation by GATK.

GATK BaseRecalibrator (Spark)

GATK BaseRecalibrator generates a recalibration table based on various co-variates.

To use the corresponding spark implementation GATK BaseRecalibratorSpark, please specify --use_gatk_spark baserecalibrator.

Output files for all samples

Output directory: {outdir}/preprocessing/recal_table/<sample>/

<sample>.recal.table
- Recalibration table associated to the duplicates-marked CRAM file.

GATK ApplyBQSR (Spark)

GATK ApplyBQSR recalibrates the base qualities of the input reads based on the recalibration table produced by the GATK BaseRecalibrator tool.

Specify --use_gatk_spark baserecalibrator to use GATK ApplyBQSRSpark instead, the respective spark implementation.

The resulting recalibrated CRAM files are delivered to the user. Recalibrated CRAM files are usually 2-3 times larger than the duplicate-marked CRAM files.

Output files for all samples

Output directory: {outdir}/preprocessing/recalibrated/<sample>/

<sample>.recal.cram and <sample>.recal.cram.crai
- CRAM file and index
if --save_output_as_bam:
- <sample>.recal.bam and <sample>.recal.bam.bai - BAM file and index

Parabricks FQ2BAM

Note

This is an experimental addition to the pipeline which is not at feature parity with the GATK implementation.

Parabricks FQ2BAM runs as alternative to GATK preprocessing, enables by --aligner parabricks --profile <docker/singularity>,gpu.

The resulting recalibrated BAM (if --save_output_as_bam) or CRAM files are delivered to the user (if --save_reference).

Output files for all samples

Output directory: {outdir}/preprocessing/parabricks/<sample>/

<sample>.{bam,cram} and <sample>.{bam.bai,cram.crai}
- BAM or CRAM file and index

CSV files

The CSV files are auto-generated and can be used by Sarek for further processing and/or variant calling.

See the input section in the usage documentation for further reading and documentation on how to make the most of them.

Output files:

Output directory: {outdir}/preprocessing/csv

mapped.csv
- if --save_mapped
- CSV containing an entry for each sample with the columns patient,sample,sex,status,bam,bai
markduplicates_no_table.csv
- CSV containing an entry for each sample with the columns patient,sample,sex,status,cram,crai
markduplicates.csv
- CSV containing an entry for each sample with the columns patient,sample,sex,status,cram,crai,table
recalibrated.csv
- CSV containing an entry for each sample with the columnspatient,sample,sex,status,cram,crai
variantcalled.csv
- CSV containing an entry for each sample with the columns patient,sample,vcf

Variant Calling

The results regarding variant calling are collected in {outdir}/variant_calling/. If some results from a variant caller do not appear here, please check out the --tools section in the parameter documentation.

(Recalibrated) CRAM files can used as an input to start the variant calling.

SNVs and small indels

For single nucleotide variants (SNVs) and small indels, multiple tools are available for normal (germline), tumor-only, and tumor-normal (somatic) paired data. For a list of the appropriate tool(s) for the data and sequencing type at hand, please check here.

bcftools

bcftools mpileup generates pileup of a CRAM file, followed by bcftools call and filtered with -i 'count(GT==\"RR\")==0. For further reading and documentation see the bcftools manual.

Output files for all samples

Output directory: {outdir}/variant_calling/bcftools/<sample>/

<sample>.bcftools.vcf.gz and <sample>.bcftools.vcf.gz.tbi
- VCF with tabix index

DeepVariant

DeepVariant is a deep learning-based variant caller that takes aligned reads, produces pileup image tensors from them, classifies each tensor using a convolutional neural network and finally reports the results in a standard VCF or gVCF file. For further documentation take a look here.

Output files for normal samples

Output directory: {outdir}/variant_calling/deepvariant/<sample>/

<sample>.deepvariant.vcf.gz and <sample>.deepvariant.vcf.gz.tbi
- VCF with tabix index
<sample>.deepvariant.g.vcf.gz and <sample>.deepvariant.g.vcf.gz.tbi
- gVCF with tabix index

FreeBayes

FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs, indels, MNPs, and complex events smaller than the length of a short-read sequencing alignment. For further reading and documentation see the FreeBayes manual.

Output files for all samples

Output directory: {outdir}/variant_calling/freebayes/{sample,normalsample_vs_tumorsample}/

<sample>.freebayes.vcf.gz and <sample>.freebayes.vcf.gz.tbi
- VCF with tabix index

GATK HaplotypeCaller

GATK HaplotypeCaller calls germline SNPs and indels via local re-assembly of haplotypes.

Output files for normal samples

Output directory: {outdir}/variant_calling/haplotypecaller/<sample>/

<sample>.haplotypecaller.vcf.gz and <sample>.haplotypecaller.vcf.gz.tbi
- VCF with tabix index

GATK Germline Single Sample Variant Calling

GATK Single Sample Variant Calling uses HaplotypeCaller in its default single-sample mode to call variants. The VCF that HaplotypeCaller emits errors on the side of sensitivity, therefore they are filtered by first running the CNNScoreVariants tool. This tool annotates each variant with a score indicating the model’s prediction of the quality of each variant. To apply filters based on those scores run the FilterVariantTranches tool with SNP and INDEL sensitivity tranches appropriate for your task.

If the haplotype-called VCF files are not filtered, then Sarek should be run with at least one of the options --dbsnp or --known_indels.

Output files for normal samples

Output directory: {outdir}/variant_calling/haplotypecaller/<sample>/

<sample>.haplotypecaller.filtered.vcf.gz and <sample>.haplotypecaller.filtered.vcf.gz.tbi
- VCF with tabix index

GATK Joint Germline Variant Calling

GATK Joint germline Variant Calling uses Haplotypecaller per sample in gvcf mode. Next, the gVCFs are consolidated from multiple samples into a GenomicsDB datastore. After joint genotyping, VQSR is applied for filtering to produce the final multisample callset with the desired balance of precision and sensitivity.

Output files from joint germline variant calling

Output directory: {outdir}/variant_calling/haplotypecaller/<sample>/

<sample>.haplotypecaller.g.vcf.gz and <sample>.haplotypecaller.g.vcf.gz.tbi
- gVCF with tabix index

Output directory: {outdir}/variant_calling/haplotypecaller/joint_variant_calling/

joint_germline.vcf.gz and joint_germline.vcf.gz.tbi
- VCF with tabix index
joint_germline_recalibrated.vcf.gz and joint_germline_recalibrated.vcf.gz.tbi
- variant recalibrated VCF with tabix index (if VQSR is applied)

GATK Mutect2

GATK Mutect2 calls somatic SNVs and indels via local assembly of haplotypes. When --joint_mutect2 is used, Mutect2 subworkflow outputs will be saved in a subfolder named with the patient ID and {patient}.mutect2.vcf.gz file will contain variant calls from all of the normal and tumor samples of the patient. For further reading and documentation see the Mutect2 manual. It is not required, but recommended to have a panel of normals (PON) using at least 40 normal samples to get filtered somatic calls. When using --genome GATK.GRCh38, a panel-of-normals file is available. However, it is highly recommended to create one matching your tumor samples. Creating your own panel-of-normals is currently not natively supported by the pipeline. See here for how to create one manually.

Output files for tumor-only and tumor/normal paired samples

Output directory: {outdir}/variant_calling/mutect2/{sample,tumorsample_vs_normalsample,patient}/

Files created:

{sample,tumorsample_vs_normalsample,patient}.mutect2.vcf.gz and {sample,tumorsample_vs_normalsample,patient}.mutect2.vcf.gz.tbi
- unfiltered (raw) Mutect2 calls VCF with tabix index
{sample,tumorsample_vs_normalsample,patient}.mutect2.vcf.gz.stats
- a stats file generated during calling of raw variants (needed for filtering)
{sample,tumorsample_vs_normalsample}.mutect2.contamination.table
- table calculating the fraction of reads coming from cross-sample contamination
{sample,tumorsample_vs_normalsample}.mutect2.segmentation.table
- table containing segmentation of the tumor by minor allele fraction
{sample,tumorsample_vs_normalsample,patient}.mutect2.artifactprior.tar.gz
- prior probabilities for read orientation artifacts
{sample,tumorsample,normalsample}.mutect2.pileups.table
- tabulates pileup metrics for inferring contamination
{sample,tumorsample_vs_normalsample,patient}.mutect2.filtered.vcf.gz and {sample,tumorsample_vs_normalsample,patient}.mutect2.filtered.vcf.gz.tbi
- filtered Mutect2 calls VCF with tabix index based on the probability that a variant is somatic
{sample,tumorsample_vs_normalsample,patient}.mutect2.filtered.vcf.gz.filteringStats.tsv
- a stats file generated during the filtering of Mutect2 called variants

Lofreq

Lofreq is a fast and sensitive variant-caller for inferring SNVs and indels from next-generation sequencing data. It makes full use of base-call qualities and other sources of errors inherent in sequencing, which are usually ignored by other methods or only used for filtering. For further reading and documentation see the Lofreq user guide.

Output files for tumor-only samples

Output directory: {outdir}/variant_calling/lofreq/<sample>/

<tumorsample>.vcf.gz
- VCF which provides a detailed description of the detected genetic variants.

MuSE

MuSE is an accurate and ultra-fast somatic mutation calling tool for whole-genome sequencing (WGS) and whole-exome sequencing (WES) data from heterogeneous tumor samples. This tool is unique in accounting for tumor heterogeneity using a sample-specific error model that improves sensitivity and specificity in mutation calling from sequencing data. For further reading see the recently published paper.

Output files for tumor-normal samples

Output directory: {outdir}/variant_calling/muse/<tumorsample_vs_normalsample>/

<tumorsample_vs_normalsample>.MuSE.txt
- TXT containing position-specific summary statistics.
<tumorsample_vs_normalsample>.muse.vcf.gz
- VCF with called variants. Fields are named TUMOR and NORMAL.

Sentieon DNAscope

Sentieon DNAscope is a variant-caller which aims at outperforming GATK’s Haplotypecaller in terms of both speed and accuracy. DNAscope allows you to use a machine learning model to perform variant calling with higher accuracy by improving the candidate detection and filtering.

Unfiltered VCF-files for normal samples

Output directory: {outdir}/variant_calling/sentieon_dnascope/<sample>/

<sample>.dnascope.unfiltered.vcf.gz and <sample>.dnascope.unfiltered.vcf.gz.tbi
- VCF with tabix index

The output from Sentieon’s DNAscope can be controlled through the option --sentieon_dnascope_emit_mode for Sarek, see Basic usage of Sentieon functions.

Unless dnascope_filter is listed under --skip_tools in the nextflow command, Sentieon’s DNAModelApply is applied to the unfiltered VCF-files in order to obtain filtered VCF-files.

Filtered VCF-files for normal samples

Output directory: {outdir}/variant_calling/sentieon_dnascope/<sample>/

<sample>.dnascope.filtered.vcf.gz and <sample>.dnascope.filtered.vcf.gz.tbi
- VCF with tabix index

Sentieon DNAscope joint germline variant calling

In Sentieon’s package DNAscope, joint germline variant calling is done by first running Sentieon’s Dnacope in emit-mode gvcf for each sample and then running Sentieon’s GVCFtyper on the set of gVCF-files. See Basic usage of Sentieon functions for information on how joint germline variant calling can be done in Sarek using Sentieon’s DNAscope.

Output files from joint germline variant calling

Output directory: {outdir}/variant_calling/sentieon_dnascope/<sample>/

<sample>.dnascope.g.vcf.gz and <sample>.dnascope.g.vcf.gz.tbi
- VCF with tabix index

Output directory: {outdir}/variant_calling/sentieon_dnascope/joint_variant_calling/

joint_germline.vcf.gz and joint_germline.vcf.gz.tbi
- VCF with tabix index

Sentieon Haplotyper

Sentieon Haplotyper is Sention’s speedup version of GATK’s Haplotypecaller (see above).

Unfiltered VCF-files for normal samples

Output directory: {outdir}/variant_calling/sentieon_haplotyper/<sample>/

<sample>.haplotyper.unfiltered.vcf.gz and <sample>.haplotyper.unfiltered.vcf.gz.tbi
- VCF with tabix index

The output from Sentieon’s Haplotyper can be controlled through the option --sentieon_haplotyper_emit_mode for Sarek, see Basic usage of Sentieon functions.

Unless haplotyper_filter is listed under --skip_tools in the nextflow command, GATK’s CNNScoreVariants and FilterVariantTranches (see above) is applied to the unfiltered VCF-files in order to obtain filtered VCF-files.

Filtered VCF-files for normal samples

Output directory: {outdir}/variant_calling/sentieon_haplotyper/<sample>/

<sample>.haplotyper.filtered.vcf.gz and <sample>.haplotyper.filtered.vcf.gz.tbi
- VCF with tabix index

Sentieon Haplotyper joint germline variant calling

In Sentieon’s package DNAseq, joint germline variant calling is done by first running Sentieon’s Haplotyper in emit-mode gvcf for each sample and then running Sentieon’s GVCFtyper on the set of gVCF-files. See Basic usage of Sentieon functions for information on how joint germline variant calling can be done in Sarek using Sentieon’s DNAseq. After joint genotyping, Sentieon’s version of VQSR (VarCal and ApplyVarCal) is applied for filtering to produce the final multisample callset with the desired balance of precision and sensitivity.

Output files from joint germline variant calling

Output directory: {outdir}/variant_calling/sentieon_haplotyper/<sample>/

<sample>.haplotyper.g.vcf.gz and <sample>.haplotyper.g.vcf.gz.tbi
- VCF with tabix index

Output directory: {outdir}/variant_calling/sentieon_haplotyper/joint_variant_calling/

joint_germline.vcf.gz and joint_germline.vcf.gz.tbi
- VCF with tabix index
joint_germline_recalibrated.vcf.gz and joint_germline_recalibrated.vcf.gz.tbi
- variant recalibrated VCF with tabix index (if VarCal is applied)

Sentieon TNscope

Sentieon TNscope is Sentieon’s proprietary somatic variant and structural variant caller.

VCF-files for tumor-only and tumor/normal samples

Output directory: {outdir}/variant_calling/sentieon_tnscope/<sample>/

<sample>.tnscope.vcf.gz and <sample>.tnscope.vcf.gz.tbi
- VCF with tabix index

Strelka

Strelka is a fast and accurate small variant caller optimized for analysis of germline variation in small cohorts and somatic variation in tumor/normal sample pairs. For further reading and documentation see the Strelka user guide. If Strelka is used for somatic variant calling and Manta is also specified in tools, the output candidate indels from Manta are used according to Strelka Best Practices. For further downstream analysis, take a look here.

Output files for single samples (normal)

Output directory: {outdir}/variant_calling/strelka/<sample>/

<sample>.strelka.genome.vcf.gz and <sample>.strelka.genome.vcf.gz.tbi
- genome VCF with tabix index
<sample>.strelka.variants.vcf.gz and <sample>.strelka.variants.vcf.gz.tbi
- VCF with tabix index with all potential variant loci across the sample. Note this file includes non-variant loci if they have a non-trivial level of variant evidence or contain one or more alleles for which genotyping has been forced.

Output files for tumor/normal paired samples

Output directory: {outdir}/variant_calling/strelka/<tumorsample_vs_normalsample>/

<tumorsample_vs_normalsample>.strelka.somatic_indels.vcf.gz and <tumorsample_vs_normalsample>.strelka.somatic_indels.vcf.gz.tbi
- VCF with tabix index with all somatic indels inferred in the tumor sample.
<tumorsample_vs_normalsample>.strelka.somatic_snvs.vcf.gz and <tumorsample_vs_normalsample>.strelka.somatic_snvs.vcf.gz.tbi
- VCF with tabix index with all somatic SNVs inferred in the tumor sample.

Structural Variants

indexcov

indexcov quickly estimate coverage from a whole-genome bam or cram index. A bam index has 16KB resolution and it is used as a coverage estimate . The output is scaled to around 1. So a long stretch with values of 1.5 would be a heterozygous duplication. This is useful as a quick QC to get coverage values across the genome.

Output directory: {outdir}/variant_calling/indexcov/

In addition to the interactive HTML files, indexcov outputs a number of text files:

<sample>-indexcov.ped: a .ped/.fam file with the inferred sex in the appropriate column if the sex chromosomes were found. the CNX and CNY columns indicating the floating-point estimate of copy-number for those chromosomes. bins.out: how many bins had a coverage value outside of (0.85, 1.15). high values can indicate high-bias samples. bins.lo: number of bins with value < 0.15. high values indicate missing data. bins.hi: number of bins with value > 1.15. bins.in: number of bins with value inside of (0.85, 1.15) p.out: bins.out/bins.in PC1...PC5: PCA projections calculated with depth of autosomes.
<sample>-indexcov.roc: tab-delimited columns of chrom, scaled coverage cutoff, and $n_samples columns where each indicates the proportion of 16KB blocks at or above that scaled coverage value.
<sample>-indexcov.bed.gz: a bed file with columns of chrom, start, end, and a column per sample where the values indicate there scaled coverage for that sample in that 16KB chunk.

Manta

Manta calls structural variants (SVs) and indels from mapped paired-end sequencing reads. It is optimized for analysis of germline variation in small sets of individuals and somatic variation in tumor/normal sample pairs. Manta provides a candidate list for small indels that can be fed to Strelka following Strelka Best Practices. For further reading and documentation see the Manta user guide.

Output files for normal samples

Output directory: {outdir}/variant_calling/manta/<sample>/

<sample>.manta.diploid_sv.vcf.gz and <sample>.manta.diploid_sv.vcf.gz.tbi
- VCF with tabix index containing SVs and indels scored and genotyped under a diploid model for the sample.

Output files for tumor-only samples

Output directory: {outdir}/variant_calling/manta/<sample>/

<sample>.manta.tumor_sv.vcf.gz and <sample>.manta.tumor_sv.vcf.gz.tbi
- VCF with tabix index containing a subset of the candidateSV.vcf.gz file after removing redundant candidates and small indels less than the minimum scored variant size (50 by default). The SVs are not scored, but include additional details: (1) paired and split read supporting evidence counts for each allele (2) a subset of the filters from the scored tumor-normal model are applied to the single tumor case to improve precision.

Output files for tumor/normal paired samples

Output directory: {outdir}/variant_calling/manta/<tumorsample_vs_normalsample>/

<tumorsample_vs_normalsample>.manta.diploid_sv.vcf.gz and <tumorsample_vs_normalsample>.manta.diploid_sv.vcf.gz.tbi
- VCF with tabix index containing SVs and indels scored and genotyped under a diploid model for the sample. In the case of a tumor/normal subtraction, the scores in this file do not reflect any information from the tumor sample.
<tumorsample_vs_normalsample>.manta.somatic_sv.vcf.gz and <tumorsample_vs_normalsample>.manta.somatic_sv.vcf.gz.tbi
- VCF with tabix index containing SVs and indels scored under a somatic variant model.

TIDDIT

TIDDIT identifies intra and inter-chromosomal translocations, deletions, tandem-duplications and inversions. For further reading and documentation see the TIDDIT manual.

Output files for normal and tumor-only samples

Output directory: {outdir}/variant_calling/tiddit/<sample>/

<sample>.tiddit.vcf.gz and <sample>.tiddit.vcf.gz.tbi
- VCF with tabix index containing SV calls
<sample>.tiddit.ploidies.tab
- tab file describing the estimated ploidy and coverage across each contig

Output files for tumor/normal paired samples

Output directory: {outdir}/variant_calling/tiddit/<tumorsample_vs_normalsample>/

<tumorsample_vs_normalsample>.tiddit.normal.vcf.gz and <tumorsample_vs_normalsample>.tiddit.normal.vcf.gz.tbi
- VCF with tabix index containing SV calls
<tumorsample_vs_normalsample>.tiddit.tumor.vcf.gz and <tumorsample_vs_normalsample>.tiddit.tumor.vcf.gz.tbi
- VCF with tabix index containing SV calls
<tumorsample_vs_normalsample>_sv_merge.tiddit.vcf.gz and <tumorsample_vs_normalsample>_sv_merge.tiddit.vcf.gz.tbi
- merged tumor/normal VCF with tabix index
<tumorsample_vs_normalsample>.tiddit.ploidies.tab
- tab file describing the estimated ploidy and coverage across each contig

Sample heterogeneity, ploidy and CNVs

ASCAT

ASCAT is a software for performing allele-specific copy number analysis of tumor samples and for estimating tumor ploidy and purity (normal contamination). It infers tumor purity and ploidy and calculates whole-genome allele-specific copy number profiles. The ASCAT process gives several images as output, described in detail in this book chapter. Running ASCAT on NGS data requires that the BAM files are converted into BAF and LogR values. This is done internally using the software AlleleCount. For further reading and documentation see the ASCAT manual.

Output files for tumor/normal paired samples

Output directory: {outdir}/variant_calling/ascat/<tumorsample_vs_normalsample>/

<tumorsample_vs_normalsample>.tumour.ASCATprofile.png
- image with information about allele-specific copy number profile
<tumorsample_vs_normalsample>.tumour.ASPCF.png
- image with information about allele-specific copy number segmentation
<tumorsample_vs_normalsample>.before_correction_Tumour.<tumorsample_vs_normalsample>.tumour.png
- image with information about raw profile of tumor sample of logR and BAF values before GC correction
<tumorsample_vs_normalsample>.before_correction_Tumour.<tumorsample_vs_normalsample>.germline.png
- image with information about raw profile of normal sample of logR and BAF values before GC correction
<tumorsample_vs_normalsample>.after_correction_GC_Tumour.<tumorsample_vs_normalsample>.tumour.png
- image with information about GC and RT corrected logR and BAF values of tumor sample after GC correction
<tumorsample_vs_normalsample>.after_correction_GC_Tumour.<tumorsample_vs_normalsample>.germline.png
- image with information about GC and RT corrected logR and BAF values of normal sample after GC correction
<tumorsample_vs_normalsample>.tumour.sunrise.png
- image visualising the range of ploidy and tumor percentage values
<tumorsample_vs_normalsample>.metrics.txt
- file with information about different metrics from ASCAT profiles
<tumorsample_vs_normalsample>.cnvs.txt
- file with information about CNVS
<tumorsample_vs_normalsample>.purityploidy.txt
- file with information about purity and ploidy
<tumorsample_vs_normalsample>.segments.txt
- file with information about copy number segments
<tumorsample_vs_normalsample>.tumour_tumourBAF.txt and <tumorsample_vs_normalsample>.tumour_normalBAF.txt
- file with beta allele frequencies
<tumorsample_vs_normalsample>.tumour_tumourLogR.txt and <tumorsample_vs_normalsample>.tumour_normalLogR.txt
- file with total copy number on a logarithmic scale

The text file <tumorsample_vs_normalsample>.cnvs.txt contains predictions about copy number state for all the segments. The output is a tab delimited text file with the following columns:

chr: chromosome number
startpos: start position of the segment
endpos: end position of the segment
nMajor: number of copies of one of the allels (for example the chromosome inherited of one parent)
nMinor: number of copies of the other allele (for example the chromosome inherited of the other parent)

The file <tumorsample_vs_normalsample>.cnvs.txt contains all segments predicted by ASCAT, both those with normal copy number (nMinor = 1 and nMajor =1) and those corresponding to copy number aberrations.

CNVKit

CNVKit is a toolkit to infer and visualize copy number from high-throughput DNA sequencing data. It is designed for use with hybrid capture, including both whole-exome and custom target panels, and short-read sequencing platforms such as Illumina. For further reading and documentation, see the CNVKit Documentation

Output files for normal and tumor-only samples

Output directory: {outdir}/variant_calling/cnvkit/<sample>/

<sample>.antitargetcoverage.cnn
- file containing coverage information
<sample>.targetcoverage.cnn
- file containing coverage information
<sample>-diagram.pdf
- file with plot of copy numbers or segments on chromosomes
<sample>-scatter.png
- file with plot of bin-level log2 coverages and segmentation calls
<sample>.bintest.cns
- file containing copy number segment information
<sample>.cnr
- file containing copy number ratio information
<sample>.cns
- file containing copy number segment information
<sample>.call.cns
- file containing copy number segment information
<sample>.genemetrics.tsv
- file containing per gene copy number information (if input files are annotated)

Output files for tumor/normal samples

Output directory: {outdir}/variant_calling/cnvkit/<tumorsample_vs_normalsample>/

<normalsample>.antitargetcoverage.cnn
- file containing coverage information
<normalsample>.targetcoverage.cnn
- file containing coverage information
<tumorsample>.antitargetcoverage.cnn
- file containing coverage information
<tumorsample>.targetcoverage.cnn
- file containing coverage information
<tumorsample>.bintest.cns
- file containing copy number segment information
<tumorsample>-scatter.png
- file with plot of bin-level log2 coverages and segmentation calls
<tumorsample>-diagram.pdf
- file with plot of copy numbers or segments on chromosomes
<tumorsample>.cnr
- file containing copy number ratio information
<tumorsample>.cns
- file containing copy number segment information
<tumorsample>.call.cns
- file containing copy number segment information
<tumorsample>.genemetrics.tsv
- file containing per gene copy number information (if input files are annotated)

Control-FREEC

Control-FREEC is a tool for detection of copy-number changes and allelic imbalances (including loss of heterozygoity (LOH)) using deep-sequencing data. Control-FREEC automatically computes, normalizes, segments copy number and beta allele frequency profiles, then calls copy number alterations and LOH. It also detects subclonal gains and losses and evaluates the most likely average ploidy of the sample. For further reading and documentation see the Control-FREEC Documentation.

Output files for tumor-only and tumor/normal paired samples

Output directory: {outdir}/variant_calling/controlfreec/{tumorsample,tumorsample_vs_normalsample}/

config.txt
- Configuration file used to run Control-FREEC
<tumorsample>_BAF.png and <tumorsample_vs_normalsample>_BAF.png
- image of BAF plot
<tumorsample>_ratio.log2.png and <tumorsample_vs_normalsample>_ratio.log2.png
- image of ratio log2 plot
<tumorsample>_ratio.png and <tumorsample_vs_normalsample>_ratio.png
- image of ratio plot
<tumorsample>.bed and <tumorsample_vs_normalsample>.bed
- translated output to a .BED file (so to view it in the UCSC Genome Browser)
<tumorsample>.circos.txt and <tumorsample_vs_normalsample>.circos.txt
- translated output to the Circos format
<tumorsample>.p.value.txt and <tumorsample_vs_normalsample>.p.value.txt
- CNV file containing p_values for each call
<tumorsample>_BAF.txt and <tumorsample_vs_normalsample>.mpileup.gz_BAF.txt
- file with beta allele frequencies for each possibly heterozygous SNP position
<tumorsample_vs_normalsample>.tumor.mpileup.gz_CNVs
- file with coordinates of predicted copy number alterations
<tumorsample>_info.txt and <tumorsample_vs_normalsample>.tumor.mpileup.gz_info.txt
- parsable file with information about FREEC run
<tumorsample>_ratio.BedGraph and <tumorsample_vs_normalsample>.tumor.mpileup.gz_ratio.BedGraph
- file with ratios in BedGraph format for visualization in the UCSC genome browser. The file contains tracks for normal copy number, gains and losses, and copy neutral LOH (*).
<tumorsample>_ratio.txt and <tumorsample_vs_normalsample>.tumor.mpileup.gz_ratio.txt
- file with ratios and predicted copy number alterations for each window
<tumorsample>_sample.cpn and <tumorsample_vs_normalsample>.tumor.mpileup.gz_sample.cpn
- files with raw copy number profiles for the tumor sample
<tumorsample_vs_normalsample>.normal.mpileup.gz_control.cpn
- files with raw copy number profiles for the control sample
<GC_profile.<tumorsample>.cpn>
- file with GC-content profile

Microsatellite instability (MSI)

Microsatellite instability is a genetic condition associated with deficiencies in the mismatch repair (MMR) system which causes a tendency to accumulate a high number of mutations (SNVs and indels). An altered distribution of microsatellite length is associated with a missed replication slippage which would be corrected under normal MMR conditions.

MSIsensor2

MSIsensor2 is a tool to detect the MSI status for tumor-only sequencing data, including Cell-Free DNA (cfDNA), Formalin-Fixed Paraffin-Embedded(FFPE) and other sample types.

Output files for tumor only samples

Output directory: {outdir}/variant_calling/msisensor2/<tumorsample>/

<tumorsample>
- MSI score output, contains information about the number of somatic sites.
<tumorsample>_dis
- The normal and tumor length distribution for each microsatellite position.
<tumorsample>_somatic
- Somatic sites detected.

MSIsensorPro

MSIsensorPro is a tool to detect the MSI status of a tumor scanning the length of the microsatellite regions. It requires a normal sample for each tumour to differentiate the somatic and germline cases. For further reading see the MSIsensor paper.

Output files for tumor/normal paired samples

Output directory: {outdir}/variant_calling/msisensor/<tumorsample_vs_normalsample>/

<tumorsample_vs_normalsample>
- MSI score output, contains information about the number of somatic sites.
<tumorsample_vs_normalsample>_dis
- The normal and tumor length distribution for each microsatellite position.
<tumorsample_vs_normalsample>_germline
- Germline sites detected.
<tumorsample_vs_normalsample>_somatic
- Somatic sites detected.

Post Variant Calling

Optional steps to further filter or fine tune variant calling results. There are two branch: Varlociraptor or bcftools (filtering, normalisation, and concatenation).

Varlociraptor

As varlociraptor requires to provide a set of candidate variants to consider it can be run in combination with any variant caller.

Output files for germline samples

Output directory: {outdir}/variant_calling/varlociraptor/{sample}

<sample>.<variantcaller>.germline.varlociraptor.vcf.gz and <sample>.<variantcaller>.germline.varlociraptor.vcf.gz.tbi
- Final VCF with tabix index
<sample>/<sample>.scenario.varlociraptor.yaml
- YAML file containing scenario for varlociraptor calling
<sample>/<sample>.alignment-properties.json
- JSON file containing alignment properties for normal sample cram

Postprocessed VCF files for tumor-normal calling

Output directory: {outdir}/variant_calling/varlociraptor/{tumorsample_vs_normalsample}

<normal_id>_vs_.<tumor_id>.<variantcaller>.somatic.varlociraptor.vcf.gz and <normal_id>_vs_.<tumor_id>.<variantcaller>.somatic.varlociraptor.vcf.gz.tbi
- Final VCF with tabix index
<normal_id>_vs_.<tumor_id>/<normal_id>_vs_.<tumor_id>.scenario.varlociraptor.yaml
- YAML file containing scenario for varlociraptor calling (somatic calling)
<normal_id>_vs_.<tumor_id>/<normal_id>.alignment-properties.json
- JSON file containing alignment properties for normal sample cram
<normal_id>_vs_.<tumor_id>/<tumor_id>.tumor.alignment-properties.json
- JSON file containing alignment properties for tumor sample cram
<sample>.<variantcaller>.merged.vcf.gz
- VCF containing both somatic and germline variants

Output files for tumor only samples

Output directory: {outdir}/variant_calling/varlociraptor/{sample}

<sample>.<variantcaller>.tumor_only.varlociraptor.vcf.gz and <sample>.<variantcaller>.tumor_only.varlociraptor.vcf.gz.tbi
- Final VCF with tabix index
<sample>/<sample>.scenario.varlociraptor.yaml
- YAML file containing scenario for varlociraptor calling
<sample>/<sample>.alignment-properties.json
- JSON file containing alignment properties for tumor_only sample cram

Filtering

VCFs from all variantcallers can be filtered using bcftools view. Filtering is enabled by setting --filter_vcfs parameter. By default, variants are filtered to include only those with PASS in the FILTER field. Custom filtering criteria can be specified using the --bcftools_filter_criteria parameter (see bcftools view documentation for filter syntax).

Filtered VCF-files for normal and tumor samples

Output directory: {outdir}/variant_calling/filtered/<sample>/

<sample>.<variantcaller>.bcftools_filtered.vcf.gz and <sample>.<variantcaller>.bcftools_filtered.vcf.gz.tbi
- VCF with tabix index containing filtered variants

Normalization

All VCFs are normalized with bcftools norm. The field SOURCE is added to the VCF header to report the variant caller.

Normalized VCF-files for normal and tumor samples

Output directory: {outdir}/variant_calling/normalized/<sample>/

<sample>.<variantcaller>.norm.sorted.vcf.gz and <sample>.<variantcaller>.norm.sorted.vcf.gz.tbi
- VCF with tabix index containing normalized variants

Consensus calling

When --snv_consensus_calling is enabled, consensus VCFs are generated from a set of multiple VCF files by using bcftools isec to identify variants that are called by multiple tools.

Strelka somatic calling results produces separate VCF files for SNPs and indels that are concatenated before consensus calling. The workflow then groups VCF files by sample and performs consensus calling across all specified variant callers.

By default, bcftools isec identifies variants present in at least a minimum number of input VCF files. This can be customized with --consensus_min_count. When annotation is enabled, both the consensus VCF and the individual caller VCFs are annotated.

Consensus called VCF files for all samples

Output directory: {outdir}/variant_calling/consensus/<sample>/

<sample>.consensus.vcf.gz and <sample>.consensus.vcf.gz.tbi
- VCF with tabix index containing variants present in the consensus set of input variant callers. Built from the sites.txt file generated by bcftools isec. Each variant includes CALLERS (which callers found this variant) and NCALLERS (number of callers) INFO fields.
<sample>_consensus/
- Directory containing intermediate bcftools isec output files:
  - 0000.vcf.gz, 0001.vcf.gz, etc. - VCFs with variants unique to or shared between specific caller combinations
  - README.txt - describes which numbered files correspond to which variant callers
  - sites.txt - lists genomic positions and their presence/absence across all input VCF files

Concatenation

Germline VCFs from DeepVariant, FreeBayes, HaplotypeCaller, Haplotyper, Manta, bcftools mpileup, Strelka, or Tiddit are concatenated with bcftools concat. The field SOURCE is added to the VCF header to report the variant caller.

Concatenated VCF-files for normal samples

Output directory: {outdir}/variant_calling/concat/<sample>/

<sample>.germline.vcf.gz and <sample>.germline.vcf.gz.tbi
- VCF with tabix index containing concatenated germline variants

Variant annotation

This directory contains results from the final annotation steps: two tools are used for annotation, snpEff and VEP. Both results can also be combined by setting --tools merge. All variants present in the called VCF files are annotated. For some variant callers this can mean that the variants are already filtered by PASS, for some this needs to be done during post-processing.

snpEff

snpeff is a genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes) using multiple databases for annotations. The generated VCF header contains the software version and the used command line. For further reading and documentation see the snpEff manual.

Output files for all samples

Output directory: {outdir}/annotation/{sample,tumorsample_vs_normalsample}

{sample,tumorsample_vs_normalsample}.<variantcaller>_snpEff.ann.vcf.gz and {sample,tumorsample_vs_normalsample}.<variantcaller>_snpEff.ann.vcf.gz.tbi
- VCF with tabix index

VEP

VEP (Variant Effect Predictor), based on Ensembl, is a tool to determine the effects of all sorts of variants, including SNPs, indels, structural variants, CNVs. The generated VCF header contains the software version, also the version numbers for additional databases like Clinvar or dbSNP used in the VEP line. The format of the consequence annotations is also in the VCF header describing the INFO field. For further reading and documentation see the VEP manual.

Currently, it contains:

Consequence: impact of the variation, if there is any
Codons: the codon change, i.e. cGt/cAt
Amino_acids: change in amino acids, i.e. R/H if there is any
Gene: ENSEMBL gene name
SYMBOL: gene symbol
Feature: actual transcript name
EXON: affected exon
PolyPhen: prediction based on PolyPhen
SIFT: prediction by SIFT
Protein_position: Relative position of amino acid in protein
BIOTYPE: Biotype of transcript or regulatory feature

plus any additional fields selected via the plugins: Condel, dbNSFP, LOFTEE, Mastermind, Phenotypes, SpliceAI, SpliceRegion.

Output files for all samples

Output directory: {outdir}/annotation/{sample,tumorsample_vs_normalsample}

{sample,tumorsample_vs_normalsample}.<variantcaller>_VEP.ann.vcf.gz and {sample,tumorsample_vs_normalsample}.<variantcaller>_VEP.ann.vcf.gz.tbi
- VCF with tabix index

BCFtools annotate

BCFtools annotate is used to add annotations to VCF files. The annotations are added to the INFO column of the VCF file. The annotations are added to the VCF header and the VCF header is updated with the new annotations. For further reading and documentation see the BCFtools annotate manual.

Output files for all samples

{sample,tumorsample_vs_normalsample}.<variantcaller>_bcf.ann.vcf.gz and {sample,tumorsample_vs_normalsample}.<variantcaller>_bcf.ann.vcf.gz.tbi
- VCF with tabix index

Quality control and reporting

Quality control

FastQC

FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.

The plots display:

Sequence counts for each sample.
Sequence Quality Histograms: The mean quality value across each base position in the read.
Per Sequence Quality Scores: The number of reads with average quality scores. Shows if a subset of reads has poor quality.
Per Base Sequence Content: The proportion of each base position for which each of the four normal DNA bases has been called.
Per Sequence GC Content: The average GC content of reads. Normal random library typically have a roughly normal distribution of GC content.
Per Base N Content: The percentage of base calls at each position for which an N was called.
Sequence Length Distribution.
Sequence Duplication Levels: The relative level of duplication found for each sequence.
Overrepresented sequences: The total amount of overrepresented sequences found in each library.
Adapter Content: The cumulative percentage count of the proportion of your library which has seen each of the adapter sequences at each position.

Output files for all samples

Output directory: {outdir}/reports/fastqc/<sample-lane>

<sample-lane_1>_fastqc.html and <sample-lane_2>_fastqc.html
- FastQC report containing quality metrics for your untrimmed raw FastQ files
<sample-lane_1>_fastqc.zip and <sample-lane_2>_fastqc.zip
- Zip archive containing the FastQC report, tab-delimited data file and plot images

Note

The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.

FastP

FastP is a tool designed to provide all-in-one preprocessing for FastQ files and is used for trimming and splitting. The tool then determines QC metrics for the processed reads.

Output files for all samples

Output directory: {outdir}/reports/fastp/<sample>

<sample-lane>_fastp.html
- report in HTML format
<sample-lane>_fastp.json
- report in JSON format
<sample-lane>_fastp.log
- FastQ log file

Mosdepth

Mosdepth reports information for the evaluation of the quality of the provided alignment data. In short, the basic statistics of the alignment (number of reads, coverage, GC-content, etc.) are summarized and a number of useful graphs are produced. For further reading and documentation see the Mosdepth documentation.

Plots will show:

cumulative coverage distribution
absolute coverage distribution
average coverage per contig/chromosome

Output files for all samples

Output directory: {outdir}/reports/mosdepth/<sample>

<sample>.{sorted,md,recal}.mosdepth.global.dist.txt
- file used by MultiQC, if .region file does not exist
<sample>.{sorted,md,recal}.mosdepth.region.dist.txt
- file used by MultiQC
<sample>.{sorted,md,recal}.mosdepth.summary.txt -A summary of mean depths per chromosome and within specified regions per chromosome.
<sample>.{sorted,md,recal}.{per-base,regions}.bed.gz
- per-base depth for targeted data, per-window (500bp) depth of WGS
<sample>.{sorted,md,recal}.regions.bed.gz.csi
- CSI index for per-base depth for targeted data, per-window (500bp) depth of WGS

NGSCheckMate

NGSCheckMate is a tool for determining whether samples come from the same genetic individual, using a set of commonly heterozygous SNPs. This enables for the detecting of sample mislabelling events. The output includes a text file indicating whether samples have matched or not according to the algorithm, as well as a dendrogram visualising these results.

Output files for all samples

Output directory: {outdir}/reports/ngscheckmate/

ngscheckmate_all.txt
- Tab delimited text file listing all the comparisons made, whether they were considered as a match, with the correlation and a normalised depth.
ngscheckmate_matched.txt
- Tab delimited text file listing only the comparison that were considered to match, with the correlation and a normalised depth.
ngscheckmate_output_corr_matrix.txt
- Tab delimited text file containing a matrix of all correlations for all comparisons made.
vcfs/<sample>.vcf.gz
- Set of vcf files for each sample. Contains calls for the set of SNP positions used to calculate sample relatedness.

GATK MarkDuplicates reports

More information in the GATK MarkDuplicates section

Duplicates can arise during sample preparation e.g. library construction using PCR. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates. If GATK MarkDuplicates is used, the metrics file generated by the tool is used, if GATK MarkDuplicatesSpark is used the report is generated by GATK4 EstimateLibraryComplexity on the mapped BAM files. For further reading and documentation see the MarkDuplicates manual.

The plot will show:

duplication statistics

Output files for all samples

Output directory: {outdir}/reports/markduplicates/<sample>

<sample>.md.cram.metrics
- file used by MultiQC

Sentieon Dedup reports

Sentieon’s DNAseq subroutine Dedup produces a metrics report much like the one produced by GATK’s MarkDuplicates. The Dedup metrics are imported into MultiQC as custom content and displayed in a table.

Output files for all samples

Output directory: {outdir}/reports/sentieon_dedup/<sample>

<sample>.dedup.cram.metrics
- file used by MultiQC.

samtools stats

samtools stats collects statistics from CRAM files and outputs in a text format. For further reading and documentation see the samtools manual.

The plots will show:

Alignment metrics.

Output files for all samples

Output directory: {outdir}/reports/samtools/<sample>

<sample>.{sorted,md,recal}.samtools.stats.out
- Raw statistics used by MultiQC

bcftools stats

bcftools stats produces a statistics text file which is suitable for machine processing and can be plotted using plot-vcfstats. For further reading and documentation see the bcftools stats manual.

Plots will show:

Stats by non-reference allele frequency, depth distribution, stats by quality and per-sample counts, singleton stats, etc.
Note: When using Strelka, there will be no depth distribution plot, as Strelka does not report the INFO/DP field

Output files for all samples

Output directory: {outdir}/reports/bcftools/

<sample>.<variantcaller>.bcftools_stats.txt
- Raw statistics used by MultiQC

VCFtools

VCFtools is a program package designed for working with VCF files. For further reading and documentation see the VCFtools manual.

Plots will show:

the summary counts of each type of transition to transversion ratio for each FILTER category.
the transition to transversion ratio as a function of alternative allele count (using only bi-allelic SNPs).
the transition to transversion ratio as a function of SNP quality threshold (using only bi-allelic SNPs).

Output files for all samples

Output directory: {outdir}/reports/vcftools/

<sample>.<variantcaller>.FILTER.summary
- Raw statistics used by MultiQC with a summary of the number of SNPs and Ts/Tv ratio for each FILTER category
<sample>.<variantcaller>.TsTv.count
- Raw statistics used by MultiQC with the Transition / Transversion ratio as a function of alternative allele count. Only uses bi-allelic SNPs.
<sample>.<variantcaller>.TsTv.qual
- Raw statistics used by MultiQC with Transition / Transversion ratio as a function of SNP quality threshold. Only uses bi-allelic SNPs.

snpEff reports

snpeff is a genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes) using multiple databases for annotations. For further reading and documentation see the snpEff manual.

The plots will show:

locations of detected variants in the genome and the number of variants for each location.
the putative impact of detected variants and the number of variants for each impact.
the effect of variants at protein level and the number of variants for each effect type.
the quantity as function of the variant quality score.

Output files for all samples

Output directory: {outdir}/reports/SnpEff/{sample,tumorsample_vs_normalsample}/<variantcaller>/

<sample>.<variantcaller>_snpEff.csv
- Raw statistics used by MultiQC
<sample>.<variantcaller>_snpEff.html
- Statistics to be visualised with a web browser
<sample>.<variantcaller>_snpEff.genes.txt
- TXT (tab separated) summary counts for variants affecting each transcript and gene

VEP reports

VEP (Variant Effect Predictor), based on Ensembl, is a tool to determine the effects of all sorts of variants, including SNPs, indels, structural variants, CNVs. For further reading and documentation see the VEP manual

Output files for all samples

Output directory: {outdir}/reports/EnsemblVEP/{sample,tumorsamplt_vs_normalsample}/<variantcaller>/

<sample>.<variantcaller>_VEP.summary.html
- Summary of the VEP run to be visualised with a web browser

Reporting

MultiQC

MultiQC is a visualization tool that generates a single HTML report summarizing all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. Results generated by MultiQC collect pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Output files

multiqc/
- multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
- multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
- multiqc_plots/: directory containing static images from the report in various formats.

Pipeline information

Output files

pipeline_info/
- Reports generated by Nextflow: execution_report_<timestamp>.html, execution_timeline_<timestamp>.html, execution_trace_<timestamp>.txt, pipeline_dag_<timestamp>.dot/pipeline_dag_<timestamp>.svg and manifest_<timestamp>.bco.json.
- Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
- Parameters used by the pipeline run: params_<timestamp>.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

Reference files

Contains reference folders generated by the pipeline. These files are only published, if --save_reference is set.

Output files

bwa/
- Index corresponding to the BWA aligner
bwamem2/
- Index corresponding to the BWA-mem2 aligner
cnvkit/
- Reference files generated by CNVKit
dragmap/
- Index corresponding to the DragMap aligner
dbsnp/
- Tabix index generated by Tabix from the given dbsnp file
dict/
- Sequence dictionary generated by GATK4 CreateSequenceDictionary from the given fasta
fai/
- Fasta index generated with samtools faidx from the given fasta
germline_resource/
- Tabix index generated by Tabix from the given gernline resource file
intervals/
- Bed files in various stages: .bed, .bed.gz, .bed per chromosome, .bed.gz per chromsome
known_indels/
- Tabix index generated by Tabix from the given known indels file
msi/
- MSIsensorPro scan of the reference genome to get microsatellites information
pon/
- Tabix index generated by Tabix from the given panel-of-normals file

nf-core/sarekEdit

Output

Introduction

Pipeline overview

Directory Structure

Preprocessing

Preparation of input files (FastQ or (u)BAM)

Clip and filter read length

Trim adapters

Split FastQ files

UMI consensus

BBSplit contamination removal

Map to Reference

BWA

BWA-mem2

DragMap

Sentieon BWA mem

Mark Duplicates

GATK MarkDuplicates (Spark)

Sentieon LocusCollector and Dedup

Base Quality Score Recalibration

GATK BaseRecalibrator (Spark)

GATK ApplyBQSR (Spark)

Parabricks FQ2BAM

CSV files

Variant Calling

SNVs and small indels

bcftools

DeepVariant

FreeBayes

GATK HaplotypeCaller

GATK Germline Single Sample Variant Calling

GATK Joint Germline Variant Calling

GATK Mutect2

Lofreq

MuSE

Sentieon DNAscope

Sentieon DNAscope joint germline variant calling

Sentieon Haplotyper

Sentieon Haplotyper joint germline variant calling

Sentieon TNscope

Strelka

Structural Variants

indexcov

Manta

TIDDIT

Sample heterogeneity, ploidy and CNVs

ASCAT

CNVKit

Control-FREEC

Microsatellite instability (MSI)

MSIsensor2

MSIsensorPro

Post Variant Calling

Varlociraptor

Filtering

Normalization

Consensus calling

Concatenation

Variant annotation

snpEff

VEP

BCFtools annotate

Quality control and reporting

Quality control

FastQC

FastP

Mosdepth

NGSCheckMate

GATK MarkDuplicates reports

Sentieon Dedup reports

samtools stats

bcftools stats

VCFtools

snpEff reports

VEP reports

Reporting

MultiQC

Pipeline information

Reference files

nf-core/sarek
Edit