nf-core/sarek
Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
2.5.1
). The latest
stable release is
3.5.0
.
Output
This document describes the output produced by the pipeline.
Pipeline overview
The pipeline processes data using the following steps:
- Preprocessing (based on GATK best practices)
- Map reads to Reference
BWA mem
- Mark Duplicates
GATK MarkDuplicates
- Base (Quality Score) Recalibration
GATK BaseRecalibrator
GATK GatherBQSRReports
GATK ApplyBQSR
- Map reads to Reference
- Variant calling
- SNVs and small indels
- Structural variants
- Sample heterogeneity, ploidy and CNVs
alleleCounter
ConvertAlleleCounts
ASCAT
samtools mpileup
Control-FREEC
- Annotation
- Variant annotation
- QC and Reporting
Preprocessing
Sarek preprocesses raw FastQ files or unmapped BAM files, based on GATK best practices.
BAM files with Recalibration tables can also be used as an input to start with the recalibration of said BAM files, for more information see TSV files output information
Duplicate Marked BAM file(s) with Recalibration Table(s)
This directory is the location for the BAM files delivered to users. Besides the duplicate marked BAM files, the recalibration tables (*.recal.table
) are also stored, and can be used to create base recalibrated files.
For further reading and documentation see the data pre-processing workflow from the GATK best practices.
For all samples:
Output directory: results/Preprocessing/[SAMPLE]/DuplicateMarked
[SAMPLE].md.bam
,[SAMPLE].md.bai
and[SAMPLE].recal.table
- BAM file and index with Recalibration Table
Recalibrated BAM file(s)
This directory is usually empty, it is the location for the final recalibrated BAM files.
Recalibrated BAM files are usually 2-3 times larger than the duplicate marked BAM files.
To re-generate recalibrated BAM file you have to apply the recalibration table delivered to the DuplicateMarked
directory either within Sarek, or doing this recalibration step yourself.
For further reading and documentation see the data pre-processing workflow from the GATK best practices.
For all samples:
Output directory: results/Preprocessing/[SAMPLE]/Recalibrated
[SAMPLE].recal.bam
and[SAMPLE].recal.bai
- BAM file and index
TSV files
The TSV files are autogenerated and can be used by Sarek for further processing and/or variant calling.
For further reading and documentation see the input documentation.
For all samples:
Output directory: results/Preprocessing/TSV
duplicateMarked.tsv
andrecalibrated.tsv
- TSV files to start Sarek from
recalibration
orvariantcalling
steps.
- TSV files to start Sarek from
duplicateMarked_[SAMPLE].tsv
andrecalibrated_[SAMPLE].tsv
- TSV files to start Sarek from
recalibration
orvariantcalling
steps for a specific sample.
- TSV files to start Sarek from
Variant Calling
All the results regarding variant-calling are collected in this directory.
Recalibrated BAM files can also be used as an input to start the Variant Calling, for more information see TSV files output information
FreeBayes
FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs, indels, MNPs, and complex events smaller than the length of a short-read sequencing alignment..
For further reading and documentation see the FreeBayes manual.
For a Tumor/Normal pair only:
Output directory: results/VariantCalling/[TUMOR_vs_NORMAL]/FreeBayes
FreeBayes_[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz
andFreeBayes_[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz.tbi
- VCF with Tabix index
HaplotypeCaller
GATK HaplotypeCaller calls germline SNPs and indels via local re-assembly of haplotypes.
Germline calls are provided for all samples, to able comparison of both tumor and normal for possible mixup.
For further reading and documentation see the HaplotypeCaller manual.
For all samples:
Output directory: results/VariantCalling/[SAMPLE]/HaploTypeCaller
HaplotypeCaller_[SAMPLE].vcf.gz
andHaplotypeCaller_[SAMPLE].vcf.gz.tbi
- VCF with Tabix index
GenotypeGVCFs
GATK GenotypeGVCFs performs joint genotyping on one or more samples pre-called with HaplotypeCaller.
Germline calls are provided for all samples, to able comparison of both tumor and normal for possible mixup.
For further reading and documentation see the GenotypeGVCFs manual.
For all samples:
Output directory: results/VariantCalling/[SAMPLE]/HaplotypeCallerGVCF
HaplotypeCaller_[SAMPLE].g.vcf.gz
andHaplotypeCaller_[SAMPLE].g.vcf.gz.tbi
- VCF with Tabix index
Mutect2
GATK Mutect2 calls somatic SNVs and indels via local assembly of haplotypes.
For further reading and documentation see the Mutect2 manual. It is recommended to have panel of normals PON for this version of Mutect2 using at least 40 normal samples, and you can add your PON file to get filtered somatic calls.
For a Tumor/Normal pair only:
Output directory: results/VariantCalling/[TUMOR_vs_NORMAL]/Mutect2
Files created:
unfiltered_Mutect2_[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz
andunfiltered_Mutect2_[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz.tbi
- unfiltered (raw) Mutect2 calls VCF with Tabix index
filtered_Mutect2_[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz
andfiltered_Mutect2_[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz.tbi
- filtered Mutect2 calls VCF with Tabix index: these entries has a PASS filter, you can get these when supplying a panel of normals using the
--pon
option
- filtered Mutect2 calls VCF with Tabix index: these entries has a PASS filter, you can get these when supplying a panel of normals using the
[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz.stats
- a stats file generated during calling raw variants (needed for filtering)
[TUMORSAMPLE]_contamination.table
- a text file exported when panel-of-normals provided about sample contamination
TIDDIT
TIDDIT identifies intra and inter-chromosomal translocations, deletions, tandem-duplications and inversions.
Germline calls are provided for all samples, to able comparison of both tumor and normal for possible mixup. Low quality calls are removed internally, to simplify processing of variant calls but they are saved by Sarek.
For further reading and documentation see the TIDDIT manual.
For all samples:
Output directory: results/VariantCalling/[SAMPLE]/TIDDIT
TIDDIT_[SAMPLE].vcf.gz
andTIDDIT_[SAMPLE].vcf.gz.tbi
- VCF with Tabix index
TIDDIT_[SAMPLE].signals.tab
- tab file describing coverage across the genome, binned per 50 bp
TIDDIT_[SAMPLE].ploidy.tab
- tab file describing the estimated ploïdy and coverage across each contig
TIDDIT_[SAMPLE].old.vcf
- VCF including the low qualiy calls
TIDDIT_[SAMPLE].wig
- wiggle file containing coverage across the genome, binned per 50 bp
TIDDIT_[SAMPLE].gc.wig
- wiggle file containing fraction of gc content, binned per 50 bp
Strelka2
Strelka2 is a fast and accurate small variant caller optimized for analysis of germline variation in small cohorts and somatic variation in tumor/normal sample pairs.
For further reading and documentation see the Strelka2 user guide.
For all samples:
Output directory: results/VariantCalling/[SAMPLE]/Strelka
Strelka_Sample_genome.vcf.gz
andStrelka_Sample_genome.vcf.gz.tbi
- VCF with Tabix index
Strelka_Sample_variants.vcf.gz
andStrelka_Sample_variants.vcf.gz.tbi
- VCF with Tabix index
For a Tumor/Normal pair:
Output directory: results/VariantCalling/[TUMOR_vs_NORMAL]/Strelka
Strelka_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_indels.vcf.gz
andStrelka_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_indels.vcf.gz.tbi
- VCF with Tabix index
Strelka_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_snvs.vcf.gz
andStrelka_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_snvs.vcf.gz.tbi
- VCF with Tabix index
Using Strelka Best Practices with the candidateSmallIndels
from Manta
:
Output directory: results/VariantCalling/[TUMOR_vs_NORMAL]/Strelka
StrelkaBP_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_indels.vcf.gz
andStrelkaBP_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_indels.vcf.gz.tbi
- VCF with Tabix index
StrelkaBP_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_snvs.vcf.gz
andStrelkaBP_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_snvs.vcf.gz.tbi
- VCF with Tabix index
Manta
Manta calls structural variants (SVs) and indels from mapped paired-end sequencing reads.
It is optimized for analysis of germline variation in small sets of individuals and somatic variation in tumor/normal sample pairs.
Manta
provides a candidate list for small indels also that can be fed to Strelka
following Strelka Best Practices.
For further reading and documentation see the Manta user guide.
For all samples:
Output directory: results/VariantCalling/[SAMPLE]/Manta
Manta_[SAMPLE].candidateSmallIndels.vcf.gz
andManta_[SAMPLE].candidateSmallIndels.vcf.gz.tbi
- VCF with Tabix index
Manta_[SAMPLE].candidateSV.vcf.gz
andManta_[SAMPLE].candidateSV.vcf.gz.tbi
- VCF with Tabix index
For Normal sample only:
Manta_[NORMALSAMPLE].diploidSV.vcf.gz
andManta_[NORMALSAMPLE].diploidSV.vcf.gz.tbi
- VCF with Tabix index
For a Tumor sample only:
Manta_[TUMORSAMPLE].tumorSV.vcf.gz
andManta_[TUMORSAMPLE].tumorSV.vcf.gz.tbi
- VCF with Tabix index
For a Tumor/Normal pair only:
Output directory: results/VariantCalling/[TUMOR_vs_NORMAL]/Manta
Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].candidateSmallIndels.vcf.gz
andManta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].candidateSmallIndels.vcf.gz.tbi
- VCF with Tabix index
Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].candidateSV.vcf.gz
andManta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].candidateSV.vcf.gz.tbi
- VCF with Tabix index
Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].diploidSV.vcf.gz
andManta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].diploidSV.vcf.gz.tbi
- VCF with Tabix index
Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].somaticSV.vcf.gz
andManta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].somaticSV.vcf.gz.tbi
- VCF with Tabix index
ConvertAlleleCounts
ConvertAlleleCounts is a R-script for converting output from AlleleCount to BAF and LogR values.
For a Tumor/Normal pair only:
Output directory: results/VariantCalling/[TUMOR_vs_NORMAL]/ASCAT
[TUMORSAMPLE].BAF
and[NORMALSAMPLE].BAF
- file with beta allele frequencies
[TUMORSAMPLE].LogR
and[NORMALSAMPLE].LogR
- file with total copy number on a logarithmic scale
ASCAT
ASCAT is a method to derive copy number profiles of tumor cells, accounting for normal cell admixture and tumor aneuploidy. ASCAT infers tumor purity and ploidy and calculates whole-genome allele-specific copy number profiles.
For further reading and documentation see the Sarek documentation about ASCAT or the ASCAT manual.
For a Tumor/Normal pair only:
Output directory: results/VariantCalling/[TUMOR_vs_NORMAL]/ASCAT
[TUMORSAMPLE].aberrationreliability.png
- Image with information about aberration reliability
[TUMORSAMPLE].ASCATprofile.png
- Image with information about ASCAT profile
[TUMORSAMPLE].ASPCF.png
- Image with information about ASPCF
[TUMORSAMPLE].rawprofile.png
- Image with information about raw profile
[TUMORSAMPLE].sunrise.png
- Image with information about sunrise
[TUMORSAMPLE].tumour.png
- Image with information about tumor
[TUMORSAMPLE].cnvs.txt
- file with information about CNVS
[TUMORSAMPLE].LogR.PCFed.txt
- file with information about LogR
[TUMORSAMPLE].purityploidy.txt
- file with information about purity ploidy
mpileup
samtools mpileup generate pileup for a BAM file.
For further reading and documentation see the samtools manual.
For all samples:
Output directory: results/VariantCalling/[SAMPLE]/mpileup
[SAMPLE].pileup.gz
- The pileup format is a text-based format for summarizing the base calls of aligned reads to a reference sequence. Alignment records are grouped by sample (SM) identifiers in @RG header lines.
Control-FREEC
Control-FREEC is a tool for detection of copy-number changes and allelic imbalances (including LOH) using deep-sequencing data. Control-FREEC automatically computes, normalizes, segments copy number and beta allele frequency profiles, then calls copy number alterations and LOH. And also detects subclonal gains and losses and evaluate the likeliest average ploidy of the sample.
For further reading and documentation see the Control-FREEC manual.
For a Tumor/Normal pair only:
Output directory: results/VariantCalling/[TUMOR_vs_NORMAL]/ControlFREEC
[TUMORSAMPLE]_vs_[NORMALSAMPLE].config.txt
- Configuration file used to run Control-FREEC
[TUMORSAMPLE].pileup.gz_CNVs
and[TUMORSAMPLE].pileup.gz_normal_CNVs
- file with coordinates of predicted copy number alterations
[TUMORSAMPLE].pileup.gz_ratio.txt
and[TUMORSAMPLE].pileup.gz_normal_ratio.txt
- file with ratios and predicted copy number alterations for each window
[TUMORSAMPLE].pileup.gz_BAF.txt
and[NORMALSAMPLE].pileup.gz_BAF.txt
- file with beta allele frequencies for each possibly heterozygous SNP position
Annotation
This directory contains results from the final annotation steps: two software are used for annotation, snpEff and VEP. Only a subset of the VCF files are annotated, and only variants that have a PASS filter. FreeBayes results are not annotated in the moment yet as we are lacking a decent somatic filter. For HaplotypeCaller the germline variations are annotated for both the tumor and the normal sample.
snpEff
snpeff is a genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes) using multiple databases for annotations. The generated VCF header contains the software version and the used command line.
For further reading and documentation see the snpEff manual
For all samples:
Output directory: results/Annotation/[SAMPLE]/snpEff
VariantCaller_Sample_snpEff.ann.vcf.gz
andVariantCaller_Sample_snpEff.ann.vcf.gz.tbi
- VCF with Tabix index
VEP
VEP (Variant Effect Predictor), based on Ensembl, is a tools to determine the effects of all sorts of variants, including SNPs, indels, structural variants, CNVs. The generated VCF header contains the software version, also the version numbers for additional databases like Clinvar or dbSNP used in the “VEP” line. The format of the consequence annotations is also in the VCF header describing the INFO field. In the moment it contains:
- Consequence: impact of the variation, if there is any
- Codons: the codon change, i.e. cGt/cAt
- Amino_acids: change in amino acids, i.e. R/H if there is any
- Gene: ENSEMBL gene name
- SYMBOL: gene symbol
- Feature: actual transcript name
- EXON: affected exon
- PolyPhen: prediction based on PolyPhen
- SIFT: prediction by SIFT
- Protein_position: Relative position of amino acid in protein
- BIOTYPE: Biotype of transcript or regulatory feature
For further reading and documentation see the VEP manual
For all samples:
Output directory: results/Annotation/[SAMPLE]/VEP
VariantCaller_Sample_VEP.ann.vcf.gz
andVariantCaller_Sample_VEP.ann.vcf.gz.tbi
- VCF with Tabix index
QC and reporting
FastQC
FastQC gives general quality metrics about your reads. It provides information about the quality score distribution across your reads, the per base sequence content (%T/A/G/C). You get information about adapter contamination and other overrepresented sequences.
For further reading and documentation see the FastQC help.
For all samples:
Output directory: results/Reports/[SAMPLE]/fastqc
sample_R1_XXX_fastqc.html
andsample_R2_XXX_fastqc.html
- FastQC report, containing quality metrics for each pair of the raw fastq files
sample_R1_XXX_fastqc.zip
andsample_R2_XXX_fastqc.zip
- zip file containing the FastQC reports, tab-delimited data files and plot images
bamQC
Qualimap bamqc reports information for the evaluation of the quality of the provided alignment data. In short, the basic statistics of the alignment (number of reads, coverage, GC-content, etc.) are summarized and a number of useful graphs are produced.
Plot will show:
- Stats by non-reference allele frequency, depth distribution, stats by quality and per-sample counts, singleton stats, etc.
For all samples:
Output directory: results/Reports/[SAMPLE]/bamQC
VariantCaller_[SAMPLE].bcf.tools.stats.out
- RAW statistics used by MultiQC
For more information about how to use Qualimap bamqc reports, see Qualimap bamqc manual
MarkDuplicates reports
GATK MarkDuplicates locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA. Duplicates can arise during sample preparation e.g. library construction using PCR. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates.
For all samples:
Output directory: results/Reports/[SAMPLE]/MarkDuplicates
[SAMPLE].bam.metrics
- RAW statistics used by MultiQC
For further reading and documentation see the MarkDuplicates manual.
samtools stats
samtools stats collects statistics from BAM files and outputs in a text format. Plots will show:
- Alignment metrics.
For all samples:
Output directory: results/Reports/[SAMPLE]/SamToolsStats
[SAMPLE].bam.samtools.stats.out
- RAW statistics used by MultiQC
For further reading and documentation see the samtools manual
bcftools stats
bcftools is a program for variant calling and manipulating files in the Variant Call Format. Plot will show:
- Stats by non-reference allele frequency, depth distribution, stats by quality and per-sample counts, singleton stats, etc.
For all samples:
Output directory: results/Reports/[SAMPLE]/BCFToolsStats
VariantCaller_[SAMPLE].bcf.tools.stats.out
- RAW statistics used by MultiQC
For further reading and documentation see the bcftools stats manual
VCFtools
VCFtools is a program package designed for working with VCF files. Plots will show:
- the summary counts of each type of transition to transversion ratio for each FILTER category.
- the transition to transversion ratio as a function of alternative allele count (using only bi-allelic SNPs).
- the transition to transversion ratio as a function of SNP quality threshold (using only bi-allelic SNPs).
For all samples:
Output directory: results/Reports/[SAMPLE]/VCFTools
VariantCaller_[SAMPLE].FILTER.summary
- RAW statistics used by MultiQC
VariantCaller_[SAMPLE].TsTv.count
- RAW statistics used by MultiQC
VariantCaller_[SAMPLE].TsTv.qual
- RAW statistics used by MultiQC
For further reading and documentation see the VCFtools manual
snpEff reports
snpeff is a genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes) using multiple databases for annotations.
Plots will shows :
- locations of detected variants in the genome and the number of variants for each location.
- the putative impact of detected variants and the number of variants for each impact.
- the effect of variants at protein level and the number of variants for each effect type.
- the quantity as function of the variant quality score.
For all samples:
Output directory: results/Reports/[SAMPLE]/snpEff
VariantCaller_Sample_snpEff.csv
- RAW statistics used by MultiQC
VariantCaller_Sample_snpEff.html
- Statistics to be visualised with a web browser
VariantCaller_Sample_snpEff.txt
- TXT (tab separated) summary counts for variants affecting each transcript and gene
For further reading and documentation see the snpEff manual
VEP reports
VEP (Variant Effect Predictor), based on Ensembl, is a tools to determine the effects of all sorts of variants, including SNPs, indels, structural variants, CNVs.
For all samples:
Output directory: results/Reports/[SAMPLE]/VEP
VariantCaller_Sample_VEP.summary.html
- Summary of the VEP run to be visualised with a web browser
For further reading and documentation see the VEP manual
MultiQC
MultiQC is a visualisation tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in within the report data directory.
The pipeline has special steps which allow the software versions used to be reported in the MultiQC output for future traceability.
For the whole Sarek run:
Output directory: results/Reports/MultiQC
multiqc_report.html
- MultiQC report - a standalone HTML file that can be viewed in your web browser
multiqc_data/
- Directory containing parsed statistics from the different tools used in the pipeline
For further reading and documentation see the MultiQC website