Introduction
This document describes the output produced by the pipeline. The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
- Simple DNA/RNA alignment
- Alignment post-processing
- MarkDups - General alignment processing
- Picard Markduplicates - Duplicate read marking
- SNV, MNV, INDEL calling
- SV calling
- CNV calling
- SV event interpretation
- LINX - SV event clustering and annotation
- Transcript analysis
- Isofox - transcript counts, novel splicing and fusion calling
- Oncoviral detection
- VIRUSBreakend - viral content and integration calling
- Virus Interpreter - oncoviral calling post-processing
- HLA calling
- LILAC - HLA calling
- HRD status prediction
- CHORD - HRD status prediction
- Mutational signature fitting
- Sigs - Mutational signature fitting
- Tissue of origin prediction
- CUPPA - Tissue of origin prediction
- Neoepitope prediction
- Neo - Neoepitope prediction
- Report generation
- ORANGE - Key results summary
- linxreport - Interactive LINX report
- Pipeline information - Workflow execution metrics
Simple DNA/RNA alignment
Alignment functionality in oncoanalyser
is simple and rigid, and exists only to meet the exact requirements of the
hmftools.
bwa-mem2
bwa-mem2 is a short-read mapping tool used to align reads to a large reference
sequences. In oncoanalyser
, bwa-mem2 is used to align DNA reads to the human genome.
No outputs are published directly from bwa-mem2, see MarkDups for the fully processed alignment outputs
STAR
STAR is a specialised mapping to used to align RNA reads to a reference transcriptome.
No outputs are published directly from STAR, see Picard MarkDuplicates for the fully processed alignment outputs
Alignment post-processing
MarkDups
Output files
<group_id>/alignments/dna/
<normal_dna_id>.duplicate_freq.tsv
: Normal DNA sample read duplicate frequencies.<normal_dna_id>.markdups.bam
: Normal DNA sample output read alignments.<normal_dna_id>.markdups.bam.bai
: Normal DNA sample output read alignments index.<tumor_rna_id>.duplicate_freq.tsv
: Tumor DNA sample read duplicate frequencies.<tumor_rna_id>.markdups.bam
: Tumor DNA sample output read alignments.<tumor_rna_id>.markdups.bam.bai
: Tumor DNA sample output read alignments index.
MarkDups applies various alignment post-processing routines such as duplicate marking and unmapping of problematic regions. It can also handle UMIs when configured to do so.
MarkDups is only run on DNA alignments
Picard MarkDuplicates
Output files
<group_id>/alignments/rna/
<tumor_rna_id>.md.bam
: Tumor RNA sample read alignments.<tumor_rna_id>.md.bam.bai
: Tumor RNA sample read alignments index.<tumor_rna_id>.md.metrics
: Tumor RNA sample read duplicate marking metrics.
Picard MarkDuplicates used to mark duplicate reads following alignment.
Picard MarkDuplicates is only run on RNA alignments
SNV, MNV, INDEL calling
SAGE
Output files
-
<group_id>/sage/append/
<tumor_dna_id>.sage.append.vcf.gz
: Tumor DNA sample small variant VCF with RNA data appended.<normal_dna_id>.sage.append.vcf.gz
: Normal DNA sample small variant VCF with RNA data appended.
-
<group_id>/sage/somatic/
<normal_dna_id>.sage.bqr.png
: Normal DNA sample base quality recalibration metrics plot.<normal_dna_id>.sage.bqr.tsv
: Normal DNA sample base quality recalibration metrics.<tumor_dna_id>.sage.bqr.png
: Tumor DNA sample base quality recalibration metrics plot.<tumor_dna_id>.sage.bqr.tsv
: Tumor DNA sample base quality recalibration metrics.<tumor_dna_id>.sage.exon.medians.tsv
: Tumor DNA sample exon median depths.<tumor_dna_id>.sage.gene.coverage.tsv
: Tumor DNA sample gene coverages.<tumor_dna_id>.sage.somatic.filtered.vcf.gz.tbi
: Tumor DNA sample filtered small variant calls index.<tumor_dna_id>.sage.somatic.filtered.vcf.gz
: Tumor DNA sample filtered small variant calls.<tumor_dna_id>.sage.somatic.vcf.gz.tbi
: Tumor DNA sample small variant calls index.<tumor_dna_id>.sage.somatic.vcf.gz
: Tumor DNA sample small variant calls.
-
<group_id>/sage/germline/
<normal_dna_id>.sage.bqr.png
: Tumor DNA sample base quality recalibration metrics plot.<normal_dna_id>.sage.bqr.tsv
: Tumor DNA sample base quality recalibration metrics.<normal_dna_id>.sage.exon.medians.tsv
: Normal DNA sample exon median depths.<normal_dna_id>.sage.gene.coverage.tsv
: Normal DNA sample gene coverages.<tumor_dna_id>.sage.bqr.png
: Normal DNA sample base quality recalibration metrics plot.<tumor_dna_id>.sage.bqr.tsv
: Normal DNA sample base quality recalibration metrics.<tumor_dna_id>.sage.germline.filtered.vcf.gz.tbi
: Normal DNA sample filtered small variant calls index.<tumor_dna_id>.sage.germline.filtered.vcf.gz
: Normal DNA sample filtered small variant calls.<tumor_dna_id>.sage.germline.vcf.gz.tbi
: Normal DNA sample small variant calls index.<tumor_dna_id>.sage.germline.vcf.gz
: Normal DNA sample small variant calls.
SAGE is a SNV, MNV, and INDEL caller optimised for 100x tumor and 40x normal.
PAVE
Output files
<group_id>/pave/
<tumor_dna_id>.sage.germline.filtered.pave.vcf.gz.tbi
: Annotated SAGE germline small variants index.<tumor_dna_id>.sage.germline.filtered.pave.vcf.gz
: Annotated SAGE germline small variants.<tumor_dna_id>.sage.somatic.filtered.pave.vcf.gz.tbi
: Annotated SAGE somatic small variants index.<tumor_dna_id>.sage.somatic.filtered.pave.vcf.gz
: Annotated SAGE somatic small variants.
PAVE annotates variants called by SAGE with impact information with regards to transcript and coding effects.
SV calling
SvPrep
SvPrep runs prior to SV calling to reducing runtime by rapidly identifying reads that are likely to be involved in a SV event.
No outputs are published directly from SvPrep, see GRIPSS for the fully processed SV calling outputs
GRIDSS
Output files
<group_id>/gridss/
<tumor_dna_id>.gridss.vcf.gz
: GRIDSS structural variants.<tumor_dna_id>.gridss.vcf.gz.tbi
: GRIDSS structural variants index.
GRIDSS is a SV caller than uses both read support and local breakend/breakpoint assemblies to call variants.
GRIPSS
Output files
-
<group_id>/gripss/germline/
<tumor_dna_id>.gripss.filtered.germline.vcf.gz
: Filtered GRIDSS germline structural variants.<tumor_dna_id>.gripss.filtered.germline.vcf.gz.tbi
: Filtered GRIDSS germline structural variants index.<tumor_dna_id>.gripss.germline.vcf.gz
: GRIDSS structural variants (GRIPSS filters set but not applied).<tumor_dna_id>.gripss.germline.vcf.gz.tbi
: GRIDSS structural variants index (GRIPSS filters set but not applied).
-
<group_id>/gripss/somatic/
<tumor_dna_id>.gripss.filtered.somatic.vcf.gz
: Filtered GRIDSS somatic structural variants.<tumor_dna_id>.gripss.filtered.somatic.vcf.gz.tbi
: Filtered GRIDSS somatic structural variants index.<tumor_dna_id>.gripss.somatic.vcf.gz
: GRIDSS structural variants (GRIPSS filters set but not applied).<tumor_dna_id>.gripss.somatic.vcf.gz.tbi
: GRIDSS structural variants index (GRIPSS filters set but not applied).
GRIPSS applies filter and post-processing to SV calls.
CNV calling
AMBER
Output files
<group_id>/amber/
amber.version
: AMBER version file.<tumor_dna_id>.amber.baf.pcf
: Tumor DNA sample piecewise constant fit.<tumor_dna_id>.amber.baf.tsv.gz
: Tumor DNA sample β-allele frequencies.<tumor_dna_id>.amber.contamination.tsv
: Tumor DNA sample contamination TSV.<tumor_dna_id>.amber.contamination.vcf.gz
: Tumor DNA sample contamination sites.<tumor_dna_id>.amber.contamination.vcf.gz.tbi
: Tumor DNA sample contamination sites index.<tumor_dna_id>.amber.qc
: AMBER QC file.<normal_dna_id>.amber.homozygousregion.tsv
: Normal DNA sample regions of homozygosity.
AMBER generates β-allele frequencies in tumor samples for CNV calling in PURPLE.
COBALT
Output files
<group_id>/cobalt/
cobalt.version
: COBALT version file.<tumor_dna_id>.cobalt.gc.median.tsv
: Tumor DNA sample GC median read depths.<tumor_dna_id>.cobalt.ratio.pcf
: Tumor DNA sample piecewise constant fit.<tumor_dna_id>.cobalt.ratio.tsv.gz
: Tumor DNA sample read counts and ratios (with reference or supposed diploid regions).<normal_dna_id>.cobalt.gc.median.tsv
: Normal DNA sample GC median read depths.<normal_dna_id>.cobalt.ratio.median.tsv
: Normal DNA sample chromosome median ratios.<normal_dna_id>.cobalt.ratio.pcf
: Normal DNA sample piecewise constant fit.
COBALT generates read depth ratios (or an estimation for tumor-only) for CNV calling in PURPLE.
PURPLE
Output files
<group_id>/purple/
circos/
: Circos plot data.<tumor_dna_id>.purple.cnv.gene.tsv
: Somatic gene copy number.<tumor_dna_id>.purple.cnv.somatic.tsv
: Copy number variant segments.<tumor_dna_id>.purple.driver.catalog.germline.tsv
: Normal DNA sample driver catalogue.<tumor_dna_id>.purple.driver.catalog.somatic.tsv
: Tumor DNA sample driver catalogue.<tumor_dna_id>.purple.germline.deletion.tsv
: Normal DNA deletions.<tumor_dna_id>.purple.germline.vcf.gz
: Normal DNA SAGE small variants with PURPLE annotations.<tumor_dna_id>.purple.germline.vcf.gz.tbi
: Normal DNA SAGE small variants with PURPLE annotations index.<tumor_dna_id>.purple.purity.range.tsv
: Purity/ploid model fit scores across a range of purity values.<tumor_dna_id>.purple.purity.tsv
: Purity/ploidy summary.<tumor_dna_id>.purple.qc
: PURPLE QC file.<tumor_dna_id>.purple.segment.tsv
: Genomic copy number segments.<tumor_dna_id>.purple.somatic.clonality.tsv
: Clonality peak model data.<tumor_dna_id>.purple.somatic.hist.tsv
: Somatic variants histogram data.<tumor_dna_id>.purple.somatic.vcf.gz
: Tumor DNA sample small variants with PURPLE annotations.<tumor_dna_id>.purple.somatic.vcf.gz.tbi
: Tumor DNA sample small variants with PURPLE annotations index.<tumor_dna_id>.purple.sv.germline.vcf.gz
: Germline structural variants with PURPLE annotations.<tumor_dna_id>.purple.sv.germline.vcf.gz.tbi
: Germline structural variants with PURPLE annotations index.<tumor_dna_id>.purple.sv.vcf.gz
: Somatic structural variants with PURPLE annotations.<tumor_dna_id>.purple.sv.vcf.gz.tbi
: Somatic structural variants with PURPLE annotations.plot/
: PURPLE plots.purple.version
: PURPLE version file.
PURPLE is a CNV caller that also infers tumor purity/ploidy and annotates both small and structural variant calls with copy-number information.
SV event interpretation
LINX
Output files
-
<group_id>/linx/germline_annotations/
linx.version
: LINX version file.<tumor_dna_id>.linx.germline.breakend.tsv
: Normal DNA sample breakend data.<tumor_dna_id>.linx.germline.clusters.tsv
: Normal DNA sample clustered events.<tumor_dna_id>.linx.germline.disruption.tsv
: Normal DNA sample breakend data.<tumor_dna_id>.linx.germline.driver.catalog.tsv
: Normal DNA sample driver catalogue.<tumor_dna_id>.linx.germline.links.tsv
: Normal DNA sample cluster links.<tumor_dna_id>.linx.germline.svs.tsv
: Normal DNA sample structural variants.
-
<group_id>/linx/somatic_annotations/
linx.version
: LINX version file.<tumor_dna_id>.linx.breakend.tsv
: Tumor DNA sample breakend data.<tumor_dna_id>.linx.clusters.tsv
: Tumor DNA sample clustered events.<tumor_dna_id>.linx.driver.catalog.tsv
: Tumor DNA sample driver catalogue.<tumor_dna_id>.linx.drivers.tsv
: Tumor DNA sample LINX driver drivers.<tumor_dna_id>.linx.fusion.tsv
: Tumor DNA sample fusions.<tumor_dna_id>.linx.links.tsv
: Tumor DNA sample cluster links.<tumor_dna_id>.linx.svs.tsv
: Tumor DNA sample structural variants.<tumor_dna_id>.linx.vis_*
: Tumor DNA sample visualisation data.
-
<group_id>/linx/somatic_plots/
all/*png
: All available tumor DNA sample cluster plots.reportable/*png
: Driver-only tumor DNA sample cluster plots.
LINX clusters PURPLE-annotated SVs into high-order events and classifies these events within a biological context. Following clustering and interpretation, events are visualised as LINX plots.
Transcript analysis
Isofox
Output files
<group_id>/isofox/
<tumor_rna_id>.isf.alt_splice_junc.csv
: Tumor RNA sample alternative splice junctions.<tumor_rna_id>.isf.fusions.csv
: Tumor RNA sample fusions, unfiltered.<tumor_rna_id>.isf.gene_collection.csv
: Tumor RNA sample gene-collection fragment counts.<tumor_rna_id>.isf.gene_data.csv
: Tumor RNA sample gene fragment counts.<tumor_rna_id>.isf.pass_fusions.csv
: Tumor RNA sample fusions, filtered.<tumor_rna_id>.isf.retained_intron.csv
: Tumor RNA sample retained introns.<tumor_rna_id>.isf.summary.csv
: Tumor RNA sample analysis summary file.<tumor_rna_id>.isf.transcript_data.csv
: Tumor RNA sample transcript fragment counts.
Isofox analyses RNA alignment data to quantify transcripts, identify novel splice junctions, and caller fusions.
Oncoviral detection
VIRUSBreakend
Output files
<group_id>/virusbreakend/
<tumor_dna_id>.virusbreakend.vcf
: Tumor DNA sample viral integratino sites.<tumor_dna_id>.virusbreakend.vcf.summary.tsv
: Tumor DNA sample analysis summary file.
VIRUSBreakend detects the presence of oncoviruses and intergration sites in tumor samples.
Virus Interpreter
Output files
<group_id>/virusinterpreter/
<tumor_dna_id>.virus.annotated.tsv
: Processed oncoviral call/annotation data.
Virus Interpreter post-processing for VIRUSBreakend calls that provides higher-level interpretation of data.
HLA calling
LILAC
Output files
<group_id>/lilac/
<tumor_dna_id>.lilac.candidates.coverage.tsv
: Coverage of high scoring candidates.<tumor_dna_id>.lilac.qc.tsv
: LILAC qc file.<tumor_dna_id>.lilac.tsv
: Analysis summary.
LILAC calls HLA Class I and characterises allelic status (copy-number alterations, somatic mutations) in the tumor sample. Analysis can also incorporate RNA data as an indirectly measurement of allele expression.
HRD status prediction
CHORD
Output files
<group_id>/chord/
<tumor_dna_id>_chord_prediction.txt
: Tumor DNA sample analysis summary file.<tumor_dna_id>_chord_signatures.txt
: Tumor DNA sample variant counts contributing to signatures.
CHORD predicts the HRD status of a tumor using statistical inference on the basis of relative somatic mutation counts.
Mutational signature fitting
Sigs
Output files
<group_id>/sigs/
<tumor_dna_id>.sig.allocation.tsv
: Tumor DNA sample signature allocations.<tumor_dna_id>.sig.snv_counts.csv
: Tumor DNA sample variant counts contributing to signatures.
Sigs fits defined COSMIC trinucleotide mutational signatures to tumor sample data.
Tissue of origin prediction
CUPPA
Output files
<group_id>/cuppa/
<tumor_dna_id>_cup_report.pdf
: Combined figure of summary and feature plot.<tumor_dna_id>.cup.data.csv
: Model feature scores.<tumor_dna_id>.cup.report.features.png
: Feature plot.<tumor_dna_id>.cup.report.summary.png
: Summary plot.<tumor_dna_id>.cuppa.chart.png
: CUPPA chart plot.<tumor_dna_id>.cuppa.conclusion.txt
: Prediction conclusion file.
CUPPA predicts tissue of origin for a given tumor sample using DNA and/or RNA features generated by upstream hmftools components.
Neoepitope prediction
Neo
Output files
<group_id>/neo/
<tumor_dna_id>.neo.neo_data.tsv
: Neoepitope candidates.<tumor_dna_id>.neo.neoepitope.tsv
: LINX fusion neoepitopes.<tumor_dna_id>.neo.peptide_scores.tsv
: Peptide binding likelihood and scoring.
Neo builds comprehensive neoepitope predictions from DNA data with additional annotations made using RNA data.
Report generation
ORANGE
Output files
<group_id>/orange/
<tumor_dna_id>.orange.json
: Aggregated report data.<tumor_dna_id>.orange.pdf
: Static report PDF.
ORANGE summaries and integrates key results from hmftool components into a single static PDF report.
linxreport
Output files
<group_id>/linx/
<tumor_dna_id>_linx.html
: Interactive HTML report.
linxreport generates an interactive report containing LINX annotations and plots.
Pipeline information
Output files
pipeline_info/
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.dot
/pipeline_dag.svg
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.yml
. Thepipeline_report*
files will only be present if the--email
/--email_on_fail
parameter’s are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
. - Parameters used by the pipeline run:
params.json
.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.