nf-core/tumourevo
Edit

Analysis pipeline to model tumour clonal evolution from WGS data (driver annotation, quality control of copy number calls, subclonal and mutational signature deconvolution)

cancer-evolutioncancer-genomicscopy-number-variationmutational-signaturespopulation-geneticsquality-controlsubclonal-deconvolutionvariant-annotation

This is the development version of the pipeline.

Launch development version https://github.com/nf-core/tumourevo

Introduction

This document describes the output produced by the pipeline. All plots generated in each step are summarised into the final report. The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

Variant Annotation
Formatter
Lifter
Catalogue Driver Annotation
QC
Subclonal Deconvolution
Signature Deconvolution

Directory Structure

The default directory structure is as follows:

{outdir}
├── bcftools
|   └── filter
│       └── <sample>
├── variant_annotation
|   └── vep
│       └── <sample>
├── driver_annotation
|   └── annotate_driver
│       └── <sample>
├── pipeline_info
├── formatter
│   ├── cna2cnaqc
│       └── <sample>
│   ├── cnaqc2tsv
│       └── <patient>
|   └── vcf2cnaqc
│       └── <sample>
├── lifter
│   ├── mpileup
│       └── <patient>
│   └── positions
│       └── <sample>
├── qc
│   ├── tinc
│       └── <sample>
│   ├── cnaqc
│       └── <sample>
│   └── join_cnaqc
│       └── <patient>
├── signature_deconvolution
|   ├── sigprofiler
│       └── <dataset>
|   └── sparsesignatures
│       └── <dataset>
├── subclonal_deconvolution
|   ├── mobster
│       └── <sample>
|   ├── viber
│       └── <patient>
|   ├── pyclonevi
│       └── <patient>
|   └── ctree
        └── <patient>,<sample>
work/
.nextflow.log

Variant Annotation

This directory contains results from the variant annotation subworkflow. At the level of individual samples, genomic variants present in the input VCF files are annotated using VEP.

VEP

VEP (Variant Effect Predictor) is a Ensembl tool that determines the effect of all variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence. This step starts from VCF files.

Output files for all samples

Output directory: {outdir}/variant_annotation/vep/<dataset>/<patient>/<sample>/

<dataset>_<patient>_<sample>.annotated.vep.vcf.gz and <dataset>_<patient>_<sample>.annotated.vep.vcf.gz.tbi
- VCF file and tabix index with called mutations

Formatter

The Formatter subworkflow is used to convert files to other formats and to standardize the output files resulting from different mutation (Mutect2, Strelka, Platypus) and cna callers (ASCAT, Sequenza, Battenberg).

vcf2cnaqc

This parser is designed to process VCF files generated by various variant calling tools and to convert them into a unified RDS file format.

Output files for all samples

Output directory: {outdir}/formatter/vcf2cnaqc/<dataset>/<patient>/<sample>/

<dataset>_<patient>_<sample>_snv.rds
- RDS file containing parsed VCF in table format

cna2cnaqc

This parser is designed to standardize copy number calls and purity estimates from various callers into a unified format.

Output files for all samples

Output directory: {outdir}/formatter/cna2cnaqc/<dataset>/<patient>/<sample>/

<dataset>_<patient>_<sample>_cna.rds
- RDS file containing parsed segments and purity estimate output in table format

cnaqc2tsv

This parser is designed to convert mutations data of joint CNAqc analysis from CNAqc format (RDS file) into a tabular format (TSV file). This step is always run for python-based tools (e.g. PyClone-VI, SigProfiler) and it is mandatory if --tools contains either pyclone-vi or sigprofiler.

Output files for all patients

Output directory: {outdir}/formatter/cnaqc2tsv/<dataset>/<patient>/

<dataset>_<patient>_joint_table.tsv
- TSV file containing mutations mapped to corresponding copy number segments.

Lifter

The Lifter subworkflow is an optional step, and it is run when, despite having multiple samples per patient, variant calling is performed separately on each sample. In this case, instead of a VCF file with a column per sample (multi-sample variant calling), a single VCF file is provided per sample. When multiple samples from the same patient are provided, the user can specify either a single joint VCF file, containing variant calls from all tumor samples of the patient (obtained from a joint mutation calling analysis) or individual sample specific VCF files.

In the latter case, path to tumor BAM files must be provided in order to collect all mutations from the samples and perform pile-up of sample’s private mutations in all the other samples. Two intermediate steps, get_positions and mpileup, are performed to identify private mutations in all the samples and retrieve their variant allele frequency. Once private mutations are properly defined, they are merged back into the original VCF file during the join_positions step. The updated VCF file is then converted into a RDS object. The results of Lifter subworkflow step are collected in {outdir}/lifter/ directory.

mpileup

At this stage, bcftools mpileup is run to retrieve frequency information of private mutations across all samples.

Output files for all samples

Output directory: {outdir}/lifter/mpileup/<dataset>/<patient>/<sample>/

<dataset>_<patient>_<sample>.bcftools_stats.txt
- TXT file with statistics on the called mutations
<dataset>_<patient>_<sample>.vcf.gz and <dataset>_<patient>_<sample>.vcf.gz.tbi:
- VCF file and tabix index with called mutations

get_positions

This step allows to retrieve private and shared mutations across samples originated from the same patient. Previously retrieved mutations are joined with original mutations present in input VCF, which is in turn converted into an RDS object using vcfR.

Output files for all samples

Output directory: {outdir}/lifter/positions/<dataset>/<patient>/<sample>/

<dataset>_<patient>_<sample>.positions_missing
- TXT file containing mutations to be retrieved for a given sample
<dataset>_<patient>_<sample>_pileup_VCF.rds
- RDS containing retrieved mutations from pileup

Output files for all patients

Output directory: {outdir}/lifter/positions/<dataset>/<patient>/

<dataset>_<patient>_all_positions.rds
- RDS containing shared and private mutations of all samples

Catalogue Driver Annotation

This directory contains results from the driver annotation subworkflow.

Tumour-type driver annotation

According to the specified tumour type, potential driver mutations are identified and annotated using IntOGen database. The user can also provide a custom table of driver genes.

Output files for all samples

Output directory: {outdir}/driver_annotation/annotate_driver/<dataset>/<patient>/<sample>/

<dataset>_<patient>_<sample>_driver.rds
- RDS with annotated mutations

QC

The QC subworkflows requires in input a segmentation file from allele-specific copy number callers (either Sequenza, ASCAT and Battenberg) and the joint VCF file. As a first step, the QC subworkflow provides an estimate of normal and tumour samples contamination in TINC step, in order to have a measure of experimental quality. Then, it first conducts a quality control on copy number and somatic mutation data for individual samples in CNAqc step, and subsequently summarize validated information at patient level in join CNAqc step.

The QC subworkflow is a crucial step of the pipeline as it ensures high confidence in identifying clonal and subclonal events while accounting for variations in tumor purity.

The results of QC subworkflow step are collected in {outdir}/qc/ directory.

TINC

TINC is a package to calculate the contamination of tumor DNA in a matched normal sample. TINC provides a methods to determine, for every matched pair of normal and tumour sample biopsies, the proportion of cancer cells, or tumour read fractions, contaminating the normal sample (Tumour in Normal, TIN). Similarly, it determines the proportion of cancer cells in the tumour sample (Tumour in Tumour, TIT), also called tumour purity.

Output files for all samples

Output directory: {outdir}/QC/tinc/<dataset>/<patient>/<sample>/

<dataset>_<patient>_<sample>_fit.rds
- TINC fit containing TIN and TIT estimates in RDS;
<dataset>_<patient>_<sample>_plot.rds and <dataset>_<patient>_<sample>_plot.pdf:
- TINC report containing TIN and TIT plots in PDF and RDS;
<dataset>_<patient>_<sample>_qc.csv:
- TINC summary report on normal contamination.

CNAqc

CNAqc is a package that performs quality control of bulk cancer sequencing data for validating copy number segmentations against variant allele frequencies of somatic mutations.

Output files for all samples

Output directory: {outdir}/QC/CNAqc/<dataset>/<patient>/<sample>/

<dataset>_<patient>_<sample>_data_plot.rds and <dataset>_<patient>_<sample>_data.pdf
- CNAqc report with genome wide mutation and allele specific copy number plots in RDS and PDF
<dataset>_<patient>_<sample>_qc_plot.rds and <dataset>_<patient>_<sample>_qc.pdf
- QC step report resulting from peak analysis in RDS and PDF
<dataset>_<patient>_<sample>_qc.rds
- CNAqc RDS object

join_CNAqc

This module creates a multi-CNAqc object for each patient by summarizing the quality check performed at the single sample level. For more information about the structure of multi-CNAqc object see the CNAqc documentation.

Output files for all patients

Output directory: {outdir}/QC/join_CNAqc/<dataset>/<patient>/

<dataset>_<patient>_multi_cnaqc.rds
- mCNAqc RDS object

Subclonal Deconvolution

The subclonal deconvolution subworkflow requires a joint mCNAqc object resulting from the join_CNAqc step as input. The subworkflow will perform multi-sample deconvolution if more than one sample for each patient is present.

The results of subclonal deconvolution step are collected in {outdir}/subclonal_deconvolution/ directory.

MOBSTER

MOBSTER is a package that models mutant allelic frequencies and copy-number status by integrating evolutionary theory and Bayesian proabilistic modelling to identify clusters of variants with similar cellular proportions. Futhermore, MOBSTER models the dynamics of passenger mutations via a Pareto distribution giving rise to the so called neutral tail.

Output files for all samples

Output directory: {outdir}/subclonal_deconvolution/mobster/<dataset>/<patient>/<sample>/

<dataset>_<patient>_<sample>_mobster_fit.rds
- RDS object contains all fits of subclonal deconvolution
<dataset>_<patient>_<sample>_mobster_best_fit.rds
- RDS object contains best fit of subclonal deconvolution
<dataset>_<patient>_<sample>_mobster_best_plots.rds
- summary plots of best fits in RDS
<dataset>_<patient>_<sample>_report.{rds,png,pdf}
- report of mobster deconvolution in RDS, PDF and PNG format

PyClone-VI

PyClone-VI is a computationally efficient Bayesian statistical method for inferring the clonal population structure of cancers, by considering allele fractions and coincident copy number variation using a variational inference approach. It works for patients with both single and multiple samples.

Output files for all samples

Output directory: {outdir}/subclonal_deconvolution/pyclonevi/<dataset>/<patient>

<dataset>_<patient>_pyclone_input.tsv
- TSV file with Pyclone-VI input table
<dataset>_<patient>_pyclone_input_all_samples.tsv
- TSV file with Pyclone-VI input table of all samples
<dataset>_<patient>_all_fits.h5
- HDF5 file for all possible fit and summary stats
<dataset>_<patient>_best_fit.txt
- TSV file for the best fit
<dataset>_<patient>_cluster_table.csv
- CSV file with clone assignment

VIBER

VIBER is an R package that implements a variational Bayesian model to fit multi-variate Binomial mixtures. In the context of subclonal deconvolution VIBER models read counts that are associated with the most represented karyotype. It works for patients with both single and multiple samples.

Output files for all samples

Output directory: {outdir}/subclonal_deconvolution/viber/<dataset>/<patient>

<dataset>_<patient>_viber_best_fit.rds
- RDS file for best standard fit
<dataset>_<patient>_viber_best_heuristic_fit.rds
- RDS file for best standard fit with applied heuristic
<dataset>_<patient>_viber_best_fit_plots.rds
- RDS file containing summary plots for best standard fit
<dataset>_<patient>_viber_best_heuristic_fit_plots.rds
- RDS file containing summary plots for best standard fit with applied heuristic
<dataset>_<patient>_viber_report.{rds,pdf,png}
- report of VIBER deconvolution in RDS, PDF and PNG

ctree

Subclonal deconvolution results are used to build clone trees from both single samples and multple samples using ctree. ctree is a R-based package which implements basic functions to create, manipulate and visualize clone trees by modelling Cancer Cell Fractions (CCF) clusters. Annotated driver genes must be provided in the input data.

Note

When --tools pyclone-vi is used, the output of PyClone-VI subclonal deconvolution is preprocessed prior to clone tree inference. Since ctree requires labeling one of the clusters as “clonal,” the one with the highest CCF across all samples is choosen.

VIBER and MOBSTER fits are already compatible for ctree analysis.

Output files for all samples

Output directory: {outdir}/subclonal_deconvolution/ctree/<dataset>/{<patient>,<patient>/<sample>/}

{<dataset>_<patient>,<dataset>_<patient>_<sample>}_ctree_<tool>.rds
- RDS file containing inferred clone tree
{<dataset>_<patient>,<dataset>_<patient>_<sample>}_ctree_<tool>_plots.rds
- RDS file for clone tree plot
{<dataset>_<patient>,<dataset>_<patient>_<sample>}_ctree_<tool>_report.{rds,png,pdf}
- ctree report as RDS, PNG and PDF

Signature Deconvolution

Mutational signatures are distinctive patterns of somatic mutations in cancer genomes that reveal the underlying mutational processes driving tumor evolution and progression. These signatures are identified by analyzing aggregated point-mutation counts from multiple samples.

Validated mutations from the join_CNAqc step are converted into a joint TSV table (see cnaqc2tsv) and then input into the signature deconvolution subworkflow, which performs de novo extraction, inference, interpretation, or deconvolution of mutational counts.

Two tools can be specified by using --tools parameter: SparseSignatures and SigProfiler. The results of this step are collected in {pubslish_dir}/signature_deconvolution/.

SparseSignatures

SparseSignatures is an R-based computational framework that provides a set of functions to extract and visualize the mutational signatures that best explain the mutation counts of a large number of patients.

Output files for dataset

Output directory: {outdir}/signatures_deconvolution/sparsesignature/<dataset>/

<dataset>_mut_counts.rds
- RDS of trinucleotide mutation counts of original data
<dataset>_best_params_config.rds
- signatures best configuration object
<dataset>_cv_means_mse.rds
- cross validation output RDS
<dataset>_nmf_Lasso_out.rds
- NMF Lasso output RDS
<dataset>_plot_all.pdf and <dataset>_plot_all.rds
- exposure plot in PDF and RDS

SigProfiler

SigProfiler is a python framework that allows de novo extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of SigProfilerMatrixGenerator and SigProfilerExtractor, seamlessly integrating with other SigProfiler tools.

Output files for all samples

Output directory: {outdir}/signatures_deconvolution/sigprofiler/<dataset>/results

input/
- folder containing a copy of the user-provided input files for SigProfilerMatrixGenerator step
input_data.txt
- joint table of all mutations in the dataset in TXT
output/
- folder containing the DBS, SBS, INDEL nucleotide matrices resulting from the SigProfilerMatrixGenerator step
{SBS96,DBS78,ID83}/
- folder containing the results of the SigProfilerExtractor step in the SBS, DBS and ID mutational contexts. This directory will contain:
  - {SBS96,DBS78,ID83}/All_Solutions/
    - subdirectory containing the results from running extractions at each rank within the range of the input. For more details visit the official website
  - {SBS96,DBS78,ID83}/Suggested_Solution/
    - subdirectory containing the optimal solution. For more details visit the official website
  - JOB_METADATA.txt
    - TXT file containing all the metadata about the system and runtime of the job
  - Seeds.txt
    - TXT file containing the replicate IDs and preset seeds

Pipeline information

Output files

pipeline_info/
- Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
- Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

Reference files

Different tools of the pipeline generate references files. If reference files for VEP and SigProfiler are not provided, they are then stored in the tool-specific folder.

VEP

When VEP cache is not specified, the desired VEP cache is downladed in {outdir}/references/VEP/vep_cache/homo_sapiens/{VEP_version}_{ref_genome}.

SigProfiler

Reference genome for SigProfiler is stored in the following folder:

Output files

Output directory: {outdir}/subclonal_deconvolution/signature_deconvolution/SigProfiler/genome/tsb/{ref_genome}

{chromosome}.txt
- genome assembly chromosme level
{ref_genome}_proportions.txt
- genome assembly proportions

nf-core/tumourevo Edit

Introduction

Pipeline overview

Directory Structure

Variant Annotation

VEP

Formatter

vcf2cnaqc

cna2cnaqc

cnaqc2tsv

Lifter

mpileup

get_positions

Catalogue Driver Annotation

Tumour-type driver annotation

QC

TINC

CNAqc

join_CNAqc

Subclonal Deconvolution

MOBSTER

PyClone-VI

VIBER

ctree

Signature Deconvolution

SparseSignatures

SigProfiler

Pipeline information

Reference files

VEP

SigProfiler

nf-core/tumourevo
Edit