Introduction

This document describes the output produced by the pipeline based on a public dataset.

The directories listed below are created in the results/ directory after the pipeline has finished. All paths are relative to the top-level results directory.

Dataset

RNAseq data has been taken from GM12878 GIAB sample SRA Accession SRR5665260. The dataset has 38.6 million PE reads of read length 2x151 bp from NextSeq 500 sequencing platform.

Samplesheet

A sample sheet has been prepared in the following way to set the FASTQ files to run the analysis.

sample,fastq_1,fastq_2,strandedness
GM12878,/data/GM12878/SRR5665260_1.fastq.gz,/data/GM12878/SRR5665260_2.fastq.gz,reverse

Execution

The pipeline has been executed with the following command:

nextflow run nf-core/rnavar -profile <institutional_config>,docker --input samplesheet.csv --genome GRCh38 --annotate_tools merge --outdir results

The <institutional_config> used in this experiment can be found here. However, you can create your own institutional config and place it on nf-core/configs and then use the config name directly in the command instead of <institutional_config> to use your own data and parameters.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

Preprocessing

cat

Output files
  • fastq/
    • *.merged.fastq.gz: If --save_merged_fastq is specified, concatenated FastQ files will be placed in this directory.

If multiple libraries/runs have been provided for the same sample in the input samplesheet (e.g. to increase sequencing depth) then these will be merged at the very beginning of the pipeline in order to have consistent sample naming throughout the pipeline. Please refer to the usage documentation to see how to specify these samples in the input samplesheet.

Alignment

STAR

STAR is a read aligner designed for splice aware mapping typical of RNA sequencing data. STAR stands for Spliced Transcripts Alignment to a Reference, and has been shown to have high accuracy and outperforms other aligners by more than a factor of 50 in mapping speed, but it is memory intensive.

MultiQC - STAR Alignment stats

Output files
  • preprocessing/[SAMPLE]/
    • [SAMPLE].aligned.bam: If --save_align_intermeds is specified the original BAM file containing read alignments to the reference genome will be placed in this directory.
    • [SAMPLE].aligned.bam.bai: This is the index of the above *.aligned.bam
  • preprocessing/[SAMPLE]/log
    • [SAMPLE].Log.final.out: STAR alignment report containing the mapping results summary.
    • [SAMPLE].Log.out and [SAMPLE].Log.progress.out: STAR log files containing detailed information about the run. Typically only useful for debugging purposes.
    • [SAMPLE].SJ.out.tab: File containing filtered splice junctions detected after mapping the reads.
  • preprocessing/[SAMPLE]/unmapped
    • [SAMPLE].unmapped_*.fastq.gz: If --save_unaligned is specified, FastQ files containing unmapped reads will be placed in this directory.
  • reports/stats/[SAMPLE]/
    • [SAMPLE].aligned.bam.flagstat: Samtools flagstat summary of the alignment
    • [SAMPLE].aligned.bam.stats: Samtools stat output

Alignment post-processing

MarkDuplicates

GATK MarkDuplicates locates and tags duplicate reads in a BAM file. The tool’s main output is a new BAM file, in which duplicates have been identified in the SAM flags field for each read.

MultiQC - MarkDuplicate stats

If desired, duplicates can be removed using the --remove_duplicates true option.

Output files
  • preprocessing/[SAMPLE]/
    • [SAMPLE].markdup.sorted.bam: Picard Markduplicate bam file.
    • [SAMPLE].markdup.sorted.bam.bai: This is the index of the above *.aligned.bam

SplitNCigarReads

GATK SplitNCigarReads is executed for post-processing RNA reads aligned against the reference. The tool’s main output is a new BAM file with reads split at N CIGAR elements and CIGAR strings updated.

Currently, the pipeline does not produce the new BAM file in the output directory.

Base (Quality Score) Recalibration

Base quality recalibration step runs by default and it produces the recalibrated BAM file for variant calling, as described below. However, you can turn off this step by using --skip_baserecalibration true option, and in that case, no recalibrated BAM file is produced and the pipeline uses the un-calibrated BAM file will be used for variant calling.

GATK BaseRecalibrator

GATK BaseRecalibrator generates a recalibration table based on various co-variates.

Currently, the pipeline does not produce the recalibration table file in the output directory.

GATK ApplyBQSR

GATK ApplyBQSR recalibrates the base qualities of the input reads based on the recalibration table produced by the GATK BaseRecalibrator tool.

Output files
  • preprocessing/[SAMPLE]/
    • [SAMPLE].recal.bam: Recalibrated bam file.
    • [SAMPLE].recal.bam.bai: This is the index of the above recalibrated bam.

Variant calling

GATK HaplotypeCaller is used to call SNVs and small indels in the sample. The Recalibrated BAM file is used as an input to this process and the output file is produced in VCF format.

Output files
  • results/variant_calling/[SAMPLE]/
    • [SAMPLE].haplotypecaller.vcf.gz: Variant calls in VCF format.
    • [SAMPLE].haplotypecaller.vcf.gz.tbi: This is the index of the above VCF file.

Variant filtering

GATK VariantFiltration is used for hard-filtering variant calls based on certain criteria. Records are hard-filtered by changing the value in the FILTER field to something other than PASS. Filtered records will be preserved in the output.

Output files
  • results/variant_calling/[SAMPLE]/
    • [SAMPLE].haplotypecaller.filtered.vcf.gz: Variant VCF with updated FILTER field.
    • [SAMPLE].haplotypecaller.filtered.vcf.gz.tbi: This is the index of the above VCF file.

Variant annotation

This directory contains results from the final annotation steps: two tools are used for annotation, snpEff and VEP.

snpEff

snpeff is a genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes) using multiple databases for annotations. The generated VCF header contains the software version and the used command line.

To annotate variants using snpeff, you can use --annotate_tools snpeff or --annotate_tools merge. The annotated variant files in VCF format can be found in results/variant_annotation folder.

MultiQC - snpEff variant by region MultiQC - snpEff variant by impact MultiQC - snpEff variant by effect types MultiQC - snpEff variant by quality

Output files
  • results/variant_annotation/[SAMPLE]/
    • [SAMPLE]_snpEff.ann.vcf.gz: Annotated VCF from snpEff process.
    • [SAMPLE]_snpEff.ann.vcf.gz.tbi: This is the index of the above VCF file.

For further reading and documentation see the snpEff manual

VEP

VEP (Variant Effect Predictor), based on Ensembl, is a tool to determine the effects of the variants. The generated VCF header contains the software version, also the version numbers for additional databases like Clinvar or dbSNP used in the VEP line. The format of the consequence annotations is also in the VCF header describing the INFO field.

Currently, it contains:

  • Consequence: impact of the variation, if there is any
  • Codons: the codon change, i.e. cGt/cAt
  • Amino_acids: change in amino acids, i.e. R/H if there is any
  • Gene: ENSEMBL gene name
  • SYMBOL: gene symbol
  • Feature: actual transcript name
  • EXON: affected exon
  • PolyPhen: prediction based on PolyPhen
  • SIFT: prediction by SIFT
  • Protein_position: Relative position of amino acid in protein
  • BIOTYPE: Biotype of transcript or regulatory feature

To annotate variants using vep, you can use --annotate_tools vep. The annotated variant files in VCF format can be found in results/variant_annotation folder.

MultiQC - VEP general statistics MultiQC - VEP SIFT summary MultiQC - VEP Polyphen summary

Output files
  • results/variant_annotation/[SAMPLE]/
    • [SAMPLE]_VEP.ann.vcf.gz: Annotated VCF from VEP process.
    • [SAMPLE]_VEP.ann.vcf.gz.tbi: This is the index of the above VCF file.

When --annotate_tools merge option is used, the annotation from both snpeff and vep are combined into a single VCF file which can be found with the following naming convention.

Output files
  • results/variant_annotation/[SAMPLE]/
    • [SAMPLE]_snpEff_VEP.ann.vcf.gz: Combined annotation from both snpEff and VEP.
    • [SAMPLE]_snpEff_VEP.ann.vcf.gz.tbi: This is the index of the above VCF file.

For further reading and documentation see the VEP manual

QC and Reporting

QC

FastQC

FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.

Plots will show:

  • Sequence counts for each sample.
  • Sequence Quality Histograms: The mean quality value across each base position in the read.
  • Per Sequence Quality Scores: The number of reads with average quality scores. Shows if a subset of reads has poor quality.
  • Per Base Sequence Content: The proportion of each base position for which each of the four normal DNA bases has been called.
  • Per Sequence GC Content: The average GC content of reads. Normal random library typically have a roughly normal distribution of GC content.
  • Per Base N Content: The percentage of base calls at each position for which an N was called.
  • Sequence Length Distribution.
  • Sequence Duplication Levels: The relative level of duplication found for every sequence.
  • Overrepresented sequences: The total amount of overrepresented sequences found in each library.
  • Adapter Content: The cumulative percentage count of the proportion of your library which has seen each of the adapter sequences at each position.

MultiQC - FASTQC Sequence Counts MultiQC - FASTQC Mean Quality Distribution MultiQC - FASTQC GC content MultiQC - FASTQC Read Length Distribution MultiQC - FASTQC Overall summary

Note

The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.

GATK MarkDuplicates reports

More information in the GATK MarkDuplicates section

Duplicates can arise during sample preparation e.g. library construction using PCR. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates.

Output files
  • reports/stats/[SAMPLE]/
    • [SAMPLE].markdup.sorted.metrics: Information about the number of duplicate reads in the sample.

For further reading and documentation see the MarkDuplicates manual.

samtools stats

samtools stats collects statistics from BAM files and outputs in a text format.

Plots will show:

  • Alignment metrics.
Output files
  • results/reports/stats/[SAMPLE]/
    • [SAMPLE].aligned.bam.flagstat: Samtools flagstat output on raw alignment BAM.
    • [SAMPLE].aligned.bam.stats: Samtools stats on raw alignment BAM.
    • [SAMPLE].markdup.sorted.bam.flagstat: Samtools flagstat output on markduplicated BAM.
    • [SAMPLE].markdup.sorted.bam.stats: Samtools flagstat output on markduplicated BAM.
    • [SAMPLE].recal.bam.stats: Samtools flagstat output on recalibrated BAM.

For further reading and documentation see the samtools manual

snpEff reports

snpeff is a genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes) using multiple databases for annotations.

Plots will shows :

  • locations of detected variants in the genome and the number of variants for each location.
  • the putative impact of detected variants and the number of variants for each impact.
  • the effect of variants at protein level and the number of variants for each effect type.
  • the quantity as function of the variant quality score.
Output files
  • results/reports/SnpEff/[SAMPLE]/
    • [SAMPLE].csv: Summary of variants by chromosome, region, effect, impact, functional class, type, etc.
    • [SAMPLE].genes.txt: TXT (tab separated) summary counts for variants affecting each transcript and gene.
    • snpEff_summary.html: Statistics with graphs to be viewed with a web browser

VEP reports

VEP (Variant Effect Predictor), based on Ensembl, is a tools to determine the effects of all sorts of variants, including SNPs, indels, structural variants, CNVs.

Output files
  • results/reports/EnsemblVEP/[SAMPLE]/
    • [SAMPLE].summary.html: Statistics with graphs to be viewed with a web browser

For further reading and documentation see the VEP manual

Reporting

MultiQC

MultiQC is a visualization tool that generates a single HTML report summarizing all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

MultiQC - Report View

The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability.

For more information about how to use MultiQC reports, see https://multiqc.info.

Output files
  • reports/
    • multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
    • multiqc_data/: directory containing parsed statistics from different tools used in the pipeline.
    • multiqc_plots/: directory containing static images from the report in various formats.

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
    • Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.