nf-core/tumourevo
Analysis pipleine to model tumour clonal evolution from WGS data (driver annotation, quality control of copy number calls, subclonal and mutational signature deconvolution)
22.10.6
.
Learn more.
Introduction
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
This document describes the output produced by the pipeline. All plots generated in each step are summarised into the final report.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- Variant Annotation - annotation of variants and cohort summary visualization
- Formatter - coversion of files to different formats (unpubblished)
- Lifter - pileup of private mutations of the other samples in multi-sample setting
- Driver Annotation - add description (unpubblished)
- QC - quality control of copy-number and somatic mutation calling and creation of multi-CNAqc object
- Subclonal Deconvolution -
- Signature Deconvolution -
Intermediate steps of the pipeline will output unpublished results which will be available for the user in the working directory of the pipeline. —>
The pipeline is built using Nextflow and consists in five main subworkflows:
Intermediate steps connetting the main subworkflows will output unpublished results which will be available in the working directory of the pipeline. These steps consist in:
Variant Annotation
This directory contains results from the variant annotation subworkflow. At the level of individual samples, genomic variants present in the input VCF files are annotated using VEP.
VEP
VEP (Variant Effect Predictor) is a Ensembl
tool that determines the effect of all variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence.
This step starts from VCF files.
Output files for all samples
Output directory:{publish_dir}/variant_annotation/VEP/dataset/patient/sample/
-
and_ _ .vcf.gz _ _ .vcf.gz.tbi - annotated VCF file with tabix index
Driver Annotation
This directory contains results from the driver annotation subworkflow.
Tumour-type driver annotation
According to the specified tumour type, potential driver mutations are identified and annotated using IntOGen database.
Output files for all samples
Output directory: {publish_dir}/driver_annotation/annotate_driver/<dataset>/<patient>/<sample>/
<dataset>_<patient>_<sample>_driver.rds
- RDS with annotated mutations
QC
The QC subworkflows requires in input a segmentation file from allele-specific copy number callers (either Sequenza, ASCAT) and the joint VCF file from join_positions subworkflow. The QC sub-workflows first conduct quality control on CNV and somatic mutation data for individual samples in CNAqc step, and subsequently summarize validated information at patient level in join_CNAqc step. The QC subworkflow is a crucial step of the pipeline as it ensures high confidence in identifying clonal and subclonal events while accounting for variations in tumor purity.
TINC
TINC is a package to calculate the contamination of tumor DNA in a matched normal sample. TINC provides estimates of the proportion of cancer cells, containing the normal sample, and the proportion of cancer cells in the tumor sample (tumor purity).
Output files for all samples
Output directory: {publish_dir}/QC/tinc/<dataset>/<patient>/<sample>/
<dataset>_<patient>_<sample>_fit.rds
- TINC fit containing TIN and TIT estimates in RDS;
<dataset>_<patient>_<sample>_plot.rds
and<dataset>_<patient>_<sample>_plot.pdf
:- TINC report contianign TIN and TIT plots in PDF and RDS;
<dataset>_<patient>_<sample>_qc.csv
:- TINC summary report on normal contamination.
CNAqc
CNAqc is a package to quality control (QC) bulk cancer sequencing data for validating copy number segmentations against variant allele frequencies of somatic mutations.
Output files for all samples
Output directory: {publish_dir}/QC/CNAqc/<dataset>/<patient>/<sample>/
<dataset>_<patient>_<sample>_data_plot.rds
and<dataset>_<patient>_<sample>_data.pdf
- CNAqc report with genome wide mutation and allele specific copy number plots in RDS and PDF
<dataset>_<patient>_<sample>_qc_plot.rds
and<dataset>_<patient>_<sample>_qc.pdf
- QC step report resulting from peak analysis in RDS and PDF
<dataset>_<patient>_<sample>_qc.rds
- CNAqc RDS object
join_CNAqc
This module creates a multi-CNAqc object for patient by summarizing the quality check performed at the single sample level. For more information about the strucutre of multi-CNAqc object see CNAqc documentation.
Output files for all patients
Output directory: {publish_dir}/QC/join_CNAqc/<dataset>/<patient>/
<dataset>_<patient>_multi_cnaqc_ALL.rds
- unfiltered multi-CNAqc RDS object
<dataset>_<patient>_multi_cnaqc_PASS.rds
- filtered multi-CNAqc RDS object
Signature Deconvolution
Mutational signatures are distinctive patterns of somatic mutations in cancer genomes that reveal the underlying mutational processes driving tumor evolution and progression. These signatures are identified by analyzing aggregated point-mutation counts from multiple samples. Validated mutations from the join_CNAqc step are converted into a joint TSV table (see tsvparse) and then input into the signature deconvolution subworkflow, which performs de novo extraction, inference, interpretation, or deconvolution of mutational counts.
The results of this step are collected in {pubslish_dir}/signature_deconvolution/
. Two tools can be specified by using --tools
parameter: SparseSignatures and SigProfiler.
SparseSignatures
SparseSignatures is an R-based computational framework that provides a set of functions to extract and visualize the mutational signatures that best explain the mutation counts of a large number of patients.
Output files for dataset
Output directory: {publish_dir}/signatures_deconvolution/SparseSig/<dataset>/
best_params_config.rds
- signatures best configiration object
cv_means_mse.rds
- cross validation output RDS
nmf_Lasso_out.rds
- NMF Lasso output RDS
plot_signatures.pdf
- exposure PDF plot
plot_signatures.rds
- exposure RDS plot
SigProfiler
SigProfiler is a python framework that allows de novo extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of SigProfilerMatrixGenerator
and SigProfilerPlotting
, seamlessly integrating with other SigProfiler
tools.
Output files for all samples
Output directory: {publish_dir}/signatures_deconvolution/SigProfiler/<dataset>/
<dataset>.
missing- missing
Subclonal Deconvolution
The subclonal deconvolution subworkflow requires in input a joint mCNAqc
object resulting from the join_CNAqc step. The subworkflow will perform multi-sample deconvolution if more than one sample for each patient is present.
The results of subclonal decovnultion step are collected in {publish_dir}/subclonal_deconvolution/
directory.
MOBSTER
MOBSTER processes mutant allelic frequencies to identify and remove neutral tails from the input data, so that subclonal reconstruction algorithms can be applied downstream to find subclones from the processed read counts.
Output files for all samples
Output directory: {publish_dir}/subclonal_deconvolution/mobster/<dataset>/<patient>/<sample>/
<dataset>_<patient>_<sample>_mobsterh_st_fit.rds
- RDS object contains all fits of subclonal deconvolution
<dataset>_<patient>_<sample>_mobsterh_st_best_fit.rds
- RDS object contains best fit of subclonal deconvolution
<dataset>_<patient>_<sample>_mobsterh_st_best_fit_plots.rds
- summary plots of best fits in RDS
<dataset>_<patient>_<sample>_REPORT_plots_mobster.{rds,png,pdf}
- report of mobster deconvolution in RDS, PDF and PNG format
PyClone-VI
PyClone-VI is a computationally efficient Bayesian statistical method for inferring the clonal population structure of cancers, by considering allele fractions and coincident copy number variation using a variational inference approach.
Output files for all samples
Output directory: {publish_dir}/subclonal_deconvolution/pyclonevi/<dataset>/<patient>
<dataset>_<patient>_pyclone_input.tsv
- TSV file with Pyclone-VI input table
<dataset>_<patient>_all_fits.h5
- HDF5 file for all possible fit and summary stats
<dataset>_<patient>_best_fit.txt
- TSV file for the best fit
<dataset>_<patient>_cluster_table.csv
- CSV file wtih clone assignment
VIBER
VIBER is an R package that implements a variational Bayesian model to fit multi-variate Binomial mixtures. In the context of subclonal deconvolution in singlesample modality, VIBER models read counts that are associated with the most represented karyotype.
Output files for all samples
Output directory: {publish_dir}/subclonal_deconvolution/viber/<dataset>/<patient>
<dataset>_<patient>_viber_best_st_fit.rds
- RDS file for best standard fit
<dataset>_<patient>_viber_best_st_fit_plots.rds
- RDS file containing summary plots for best standard fit
<dataset>_<patient>_viber_best_st_heuristic_fit.rds
- RDS file for best standard fit with applied heuristic
<dataset>_<patient>_viber_best_st_heuristic_fit_plots.rds
- RDS file containing summary plots for best standard fit with applied heuristic
<dataset>_<patient>_viber_best_st_mixing_plots.rds
- RDS file containing mixing proportion plot for best standard fit
<dataset>_<patient>_viber_best_st_heuristic_mixing_plots.rds
- RDS file containing mixing proportion plot for best standard fit with applied heuristic
<dataset>_<patient>_REPORT_plots_viber.{rds,pdf,png}
- report of VIBER deconvolution in RDS, PDF and PNG
ctree
Subclonal deconvolution results are used to build clone tree from both single samples and multple samples using ctree. ctree is a R-based package which implements basic functions to create, manipulate and visualize clone trees by modelling Cancer Cell Fractions (CCF) clusters. Annotated driver genes must be provided in the input data.
NB: When
--tools pyclone-vi
is used, the output of PyClone-VI subclonal deconvolution is preprocessed prior to clone tree inference. Since ctree requires labeling one of the clusters as “clonal,” the one with the highest CCF across all samples is choosen.
VIBER and MOBSTER fits are already compatible for ctree analysis.
Output files for all samples
Output directory: {publish_dir}/subclonal_deconvolution/ctree/<dataset>/{<patient>,<patient>/<sample>/}
{<dataset>_<patient>,<dataset>_<patient>_<sample>}_ctree_<tool>.rds
- RDS file containing inferred clone tree
{<dataset>_<patient>,<dataset>_<patient>_<sample>}_ctree_<tool>_plots.rds
- RDS file for clone tree plot
{<dataset>_<patient>,<dataset>_<patient>_<sample>}_REPORT_plots_ctree_<tool>.{rds,png,pdf}
- ctree report in RDS,PNG and PDF
Output directory: {publish_dir}/subclonal_deconvolution/ctree/<patient>/
ctree_<tool>.rds
- RDS file containing inferred clone tree
ctree_<tool>_plots.rds
- RDS file for single sample clone tree plot
ctree_input_pyclonevi.csv
- CSV file required for clone tree inference from pyclone
Add description
Output files for all samples
Output directory: {publish_dir}/subclonal_deconvolution/ctree/<dataset>/<patient>/<sample>/
name_of_the_file
- add description on this part
Unpublished results
Formatter
The Formatter subworkflow is used to convert file to other formats and to standardize the output files resulting from different mutation (Mutect2, Strelka) and cna callers (ASCAT,Sequenza). Output files from this step are not published.
cna2CNAqc
This parser aims at standardize into a unique format copy number calls and purity estimate from different callers.
Output files for all samples
Output directory: {work_dir}/formatter/cna2cnaqc/<dataset>/<patient>/<sample>/
<dataset>_<patient>_<sample>_cna.rds
- RDS file containing parsed cna output in table format
vcf2cnaqc
This parser aims at standardize into a unique format single nucleotide variants from different callers.
Output files for all samples
Output directory: {work_dir}/formatter/vcf2cnaqc/<dataset>/<patient>/<sample>/
<dataset>_<patient>_<sample>_snv.rds
- RDS file containing parsed vcf in table format
cnaqc2tsv
This parser aims at converting mutations data of joint CNAqc analysis from CNAqc format (RDS file) into a tabular format (TSV file). This step is mandatory for running python-based tools (e.g. PyClone-VI, SigProfiler).
Output files for all patients
Output directory: {work_dir}/formatter/cnaqc2tsv/<dataset>/<patient>/
<dataset>_<patient>_joint_table.tsv
- TSV file containing cna and variants joint .
Lifter
The Lifter subworkflow is an optional step and it is run when single sample VCF file are provided. When multiple samples from the same patient are provided, the user can specify either a single joint VCF file, containing variant calls from all tumor samples of the patient (see joint variant calling), or individual sample specific VCF files. In the latter case, path to tumor BAM files must be provided in order to collect all mutations from the samples and perform pile-up of sample’s private mutations in all the other samples. Two intermediate steps, get_positions and mpileup, are performed to identify private mutations in all the samples and retrieve their variant allele frequency. Once private mutations are properly defined, they are merged back into the original VCF file during the join_positions step. The updated VCF file is then converted into a vcfR
RDS object.
mpileup
At this stage, bcftools is used to perform the pileup in order to retrieve frequency information of private mutations across all samples.
Output files for all samples
Output directory: {work_dir}/lifter/mpileup/<dataset>/<patient>/<sample>/
<dataset>_<patient>_<sample>.bcftools_stats.txt
- TXT file with statistics on the called mutations
<dataset>_<patient>_<sample>.vcf.gz
and<dataset>_<patient>_<sample>.vcf.gz.tbi
:- VCF file and tabix index with called mutations
positions
This intermediate step allows to retrieve private and shared mutations across samples originated from the same patient. Retrieved mutations are joined with original mutations present in input VCF, which is in turn converted into an RDS object using vcfR.
Output files for all samples
Output directory: {publish_dir}/lifter/positions/<dataset>/<patient>/<sample>/
<dataset>_<patient>_<sample>.pileup_VCF.rds
- RDS containing shared and private mutations
<dataset>_<patient>_<sample>.positions_missing
- TXT file containing mutations to be retrieved for a given sample
Output files for all patients
Output directory: {publish_dir}/lifter/positions/<dataset>/<patient>/
<dataset>_<patient>__all_positions.rds
- RDS containing shared and private mutations
Pipeline information
Output files
pipeline_info/
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.dot
/pipeline_dag.svg
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.yml
. Thepipeline_report*
files will only be present if the--email
/--email_on_fail
parameter’s are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
. - Parameters used by the pipeline run:
params.json
.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.