nf-core/rnafusion
RNA-seq analysis pipeline for detection of gene-fusions
Introduction
This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- Download and build references - Build references needed to run the rest of the pipeline
- STAR - Alignment of FASTQ files
- Arriba - Arriba fusion detection
- STAR-fusion - STAR-fusion fusion detection
- StringTie - StringTie assembly
- FusionCatcher - Fusion catcher fusion detection
- CTAT-SPLICING - Detection and annotation of cancer splicing aberrations
- Samtools - SAM/BAM file manipulation
- Fusion-report - Summary of the findings of each tool and comparison to COSMIC, Mitelman, and FusionGDB2 databases
- FusionInspector - Supervised analysis of fusion predictions from fusion-report, recover and re-score evidence for such predictions
- Arriba visualisation - Arriba visualisation report for FusionInspector fusions
- Picard - Collect QC metrics
- FastQC - Raw read QC
- Salmon - Normalized gene expression calculation
- MultiQC - Aggregate report describing results and QC from the whole pipeline
- Pipeline information - Report metrics generated during the workflow execution
Download and build references
Output reference files and folder structure
References directory structure
<genomes_base>/
<genome>/
arriba/
blacklist_hg38_GRCh38_v<gencode_version>.tsv.gz
cytobands_hg38_GRCh38_v<gencode_version>.tsv
known_fusions_hg38_GRCh38_v<gencode_version>.tsv.gz
protein_domains_hg38_GRCh38_v<gencode_version>.gff3
fusion_report_db/
cosmic.db
fusiongdb2.db
mitelman.db
gencode_v<gencode_version>/
fusioncatcher/
gencode/
Homo_sapiens.GRCh38.<gencode_version>.all.fa
Homo_sapiens.GRCh38.<gencode_version>.cdna.all.fa.gz
Homo_sapiens.GRCh38.<gencode_version>.gtf
Homo_sapiens.GRCh38.<gencode_version>.chr.gtf
Homo_sapiens.GRCh38.<gencode_version>.chr.gtf.refflat
Homo_sapiens.GRCh38.<gencode_version>.interval_list
salmon/
star/
starfusion/
ctat_genome_lib_build_dir/
hgnc/
hgnc_complete_set.txt
HGNC-DB-timestamp.txt
(Only files or folders used by the pipeline are mentioned explicitly.)
Main pipeline workflow
If no argument is specified here, the tool was used with default parameters.
Directory structure
{outdir}
├── agat
├── arriba
├── arriba_visualisation
├── ctatsplicing
├── fastp
├── fastqc
├── fastqc_for_fastp
├── fusioncatcher
├── fusioninspector
├── fusionreport
├── multiqc
├── picard
├── pipeline_info
├── salmon
├── star
├── starfusion
├── stringtie
└── vcf
.nextflow.log
If no parameters are specified, the default is applied.
Agat
Agat is to convert the GFF file to a TSV file, which is then used in the VCF creation.
Output files
arriba/
<sample>.tsv
Arriba
Arriba is used for i) detect gene fusions and ii) create a PDF report for the fusions found (visualisation):
Detection
Output files
arriba/
<sample>.arriba.fusions.tsv
- contains the identified fusions<sample>.arriba.fusions.discarded.tsv
Visualisation
Output files
arriba_visualisation/
<sample>_combined_fusions_arriba_visualisation.pdf
The visualisation displays the fusions that fusioninspector outputs. That means that fusions from all callers are aggregated (by fusion-report) and then analyzed through fusioninspector (Note: Fusioninspecor contains a filtering step!).
CTAT-SPLICING
If --tools ctatsplicing
is present, CTAT-SPLICING will detect cancer splicing aberrations.
Output files
ctatsplicing/
<sample>.cancer_intron_reads.sorted.bam
<sample>.cancer_intron_reads.sorted.bam.bai
<sample>.cancer.introns
<sample>.cancer.introns.prelim
<sample>.chckpts
<sample>.ctat-splicing.igv.html
<sample>.gene_reads.sorted.sifted.bam
<sample>.gene_reads.sorted.sifted.bam.bai
<sample>.igv.tracks
<sample>.introns
<sample>.introns.for_IGV.bed
CTAT-SPLICING detects and annotates of aberrant splicing isoforms in cancer. This is run on the input files for arriba
and/or starfusion
.
Fastp
If --tools fastp
is present, fastp will filter low quality reads as well as bases at the 5’ and 3’ ends, trim adapters (automatically detected, but input with parameter --adapter_fasta
is possible). 3’ trimming is also possible via parameter --trim_tail
.
As fusioncatcher is especially sensitive to read length, 100 bp being the recommended length, an additional parameter --trim_tail_fusioncatcher
triggers an extra fastp process with 3’ trimming of the length given, these triggered reads are then fed to fusioncatcher but the non-extra trimmed ones are still used for arriba and STAR-Fusion.
Output files
fastp/
<sample>_1.fastp.fastq.gz
<sample>_2.fastp.fastq.gz
<sample>.fastp.html
<sample>.fastp.json
<sample>.fastp.log
Fastp for fusioncatcher
If trim_tail_fusioncatcher
has any value other than 0, fastp will be run again as above. This allows for additional trimming of read tails before running FusionCatcher. For example if reads are 150bp, using --trim_tail_fusioncatcher 50
will shorten reads to 100 bp by 50 bases from the 3′ end. 100 bp is the recommended read length to feed into FusionCatcher. The default for --trim_tail_fusioncatcher
is 0 (no trimming).
Output files
fastp/
<sample>_trimmed_for_fusioncatcher_1.fastp.fastq.gz
<sample>_trimmed_for_fusioncatcher_2.fastp.fastq.gz
<sample>_trimmed_for_fusioncatcher.fastp.html
<sample>_trimmed_for_fusioncatcher.fastp.json
<sample>_trimmed_for_fusioncatcher.fastp.log
FastQC
Output files
fastqc/
*.html
: FastQC report containing quality metrics.*.zip
: Zip archive containing the FastQC report, tab-delimited data file and plot images.
FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.
FastQC for fastp
Output files
fastqc_for_fastp/
*.html
: FastQC report containing quality metrics.*.zip
: Zip archive containing the FastQC report, tab-delimited data file and plot images.
This directory contains FastQC reports for the fastp trimmed reads.
FusionCatcher
Output files
fusioncatcher
<sample>.fusion-genes.txt
<sample>.summary.txt
<sample>.log
FusionCatcher searches for novel/known somatic fusion genes translocations, and chimeras in RNA-seq data. Possibility to use parameter --fusioncatcher_limitSjdbInsertNsj
to modify limitSjdbInsertNsj.
FusionInspector
Output files
fusioninspector
<sample>/
chckpts_dir/
- contains checkpoints for FusionInspectorfi_workdir/
- The working directory for FusionInspectorFusionInspector.log
IGV_inputs/
- contains IGV input files<sample>.FusionInspector.fusions.abridged.tsv
<sample>.FusionInspector.fusions.tsv
<sample>.fusion_inspector_web.html
- visualisation report described in details here
FusionInspector performs a validation of fusion transcript predictions. Possibility to use --fusioninspector_limitSjdbInsertNsj
to set limitSjdbInsertNsj to anything other than the default 1000000.
Fusion-report
Please note that fusion-report is executed from fork https://github.com/Clinical-Genomics/fusion-report
Output files
fusionreport
-
- ` .fusionreport_filtered.tsv` - ` _fusionreport_index.html` - general report for all filtered fusions - ` .fusionreport.tsv` - ` .fusions.csv` - index in csv format - ` .fusions.json` - index in json format - ` _ .html` - specific report for each filtered fusion
-
Fusion-report is a tool for parsing outputs from fusion detection tools. The Fusion Indication Index is explained here: https://github.com/Clinical-Genomics/fusion-report/blob/master/docs/score.md. Summary:
The weights for databases are as follows:
- COSMIC (50)
- MITELMAN (50)
- FusionGDB2 (0)
The Fusion Indication Index FII is calculated using two components:
-
Tool Detection (50% of total FII)
- Calculated as: (number of tools detecting the fusion) / (number of tools actually used)
- This reflects how many of the active tools found the fusion
-
Database Hits (50% of total FII)
- Based on database matches using weights above
- Calculated as: (number of database hits) / (total possible database hits)
Final score = (0.5 × Tool Detection Score) + (0.5 × Database Hits Score)
MultiQC
The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.
Picard
Output files
Picard CollectRnaMetrics and picard MarkDuplicates share the same output directory.
picard
<sample>.MarkDuplicates.metrics.txt
- metrics from MarkDuplicates<sample>_rna_metrics.txt
- metrics from CollectRnaMetrics<sample>_insert_size_metrics.txt.txt
- metrics from CollectInsertSizeMetrics<sample>.bam
- BAM file with marked duplicates
Pipeline information
Output files
pipeline_info/
- Reports generated by Nextflow:
execution_report_<timestamp>.html
,execution_timeline_<timestamp>.html
,execution_trace_<timestamp>.txt
,pipeline_dag_<timestamp>.dot
/pipeline_dag_<timestamp>.svg
andmanifest_<timestamp>.bco.json
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.yml
. Thepipeline_report*
files will only be present if the--email
/--email_on_fail
parameter’s are used when running the pipeline. - Parameters used by the pipeline run:
params_<timestamp>.json
.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
Salmon
Output files
salmon
<sample>
Folder containing the quantification results
STAR
STAR is used to align FASTQ files to the genome reference.
Additionally, CRAM files can also be created when passing the --cram
option. The CRAM conversion is done with a combination of samtools view
and samtools index
.
Output files
Common
star
<sample>.Aligned.sortedByCoord.out.bam
<sample>.Aligned.sortedByCoord.out.bam.bai
<sample>.Aligned.sortedByCoord.out.cram
- when--cram
is used<sample>.Aligned.sortedByCoord.out.cram.crai
- when--cram
is used<sample>.Chimeric.out.junction
<sample>.Log.final.out
<sample>.Log.out
<sample>.Log.progress.out
<sample>.ReadsPerGene.out.tab
<sample>.SJ.out.tab
The STAR index is generated with --sjdbOverhang ${params.read_length - 1}
, params.read_length
default is 100.
STAR-fusion
Output files
starfusion
<sample>.starfusion.fusion_predictions.tsv
- contains the identified fusions<sample>.starfusion.abridged.tsv
- contains the identified fusions abridged
StringTie
Output files
stringtie/<sample>/stringtie.merged.gtf
- merged gtf from annotation and stringtie output gtfs
Vcf_collect
Output files
vcf
<sample>_fusion_data.vcf
- contains the fusions in vcf format with collected statistics.
Vcf-collect takes as input the results of fusion-report and fusioninspector. That means fusions from all tools are aggregated. Fusioninspector applies a filter so it is possible some fusions detected by a caller are not filtered out by fusioninspector. In those cases, vcf-collect will display the fusions, but a lot of data will be missing as fusioninspector performs the analysis for each fusion.