nf-core/riboseq
Edit

Pipeline for the analysis of ribosome profiling, or Ribo-seq (also named ribosome footprinting) data.

This is the development version of the pipeline.

Launch development version https://github.com/nf-core/riboseq

Introduction

This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report generated from the full-sized test dataset for the pipeline using a command similar to the one below:

nextflow run nf-core/riboseq -profile test_full,<docker/singularity/institute>

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

nf-core/riboseq: Output

Preprocessing

cat

Output files

preprocessing/fastq/
- *.merged.fastq.gz: If --save_merged_fastq is specified, concatenated FastQ files will be placed in this directory.

If multiple libraries/runs have been provided for the same sample in the input samplesheet (e.g. to increase sequencing depth) then these will be merged at the very beginning of the pipeline in order to have consistent sample naming throughout the pipeline. Please refer to the usage documentation to see how to specify these samples in the input samplesheet.

fq lint

Output files

fq_lint/*
- *.fq_lint.txt: Linting report per library from fq lint.

NB: You will see subdirectories here based on the stage of preprocessing for the files that have been linted, for example raw, trimmed.

fq lint runs several checks on input FASTQ files. It will fail with a non-zero error code when issues are found, which will terminate the workflow execution. In the absence of this, the successful linting produces the logs you will find here.

FastQC

Output files

preprocessing/fastqc/
- *_fastqc.html: FastQC report containing quality metrics.
- *_fastqc.zip: Zip archive containing the FastQC report, tab-delimited data file and plot images.

NB: The FastQC plots in this directory are generated relative to the raw, input reads. They may contain adapter sequence and regions of low quality. To see how your reads look after adapter and quality trimming please refer to the FastQC reports in the trimgalore/fastqc/ directory.

FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.

MultiQC - FastQC sequence counts plot

MultiQC - FastQC mean quality scores plot

MultiQC - FastQC adapter content plot

UMI-tools extract

Output files

umitools/
- *.fastq.gz: If --save_umi_intermeds is specified, FastQ files after UMI extraction will be placed in this directory.
- *.log: Log file generated by the UMI-tools extract command.

UMI-tools and UMICollapse deduplicate reads based on unique molecular identifiers (UMIs) to address PCR-bias. Firstly, the UMI-tools extract command removes the UMI barcode information from the read sequence and adds it to the read name. Secondly, reads are deduplicated based on UMI identifier after mapping as highlighted in the UMI dedup section.

To facilitate processing of input data which has the UMI barcode already embedded in the read name from the start, --skip_umi_extract can be specified in conjunction with --with_umi.

TrimGalore

Output files

preprocessing/trimgalore/
- *.fq.gz: If --save_trimmed is specified, FastQ files after adapter trimming will be placed in this directory.
- *_trimming_report.txt: Log file generated by Trim Galore!.
preprocessing/trimgalore/fastqc/
- *_fastqc.html: FastQC report containing quality metrics for read 1 (and read2 if paired-end) after adapter trimming.
- *_fastqc.zip: Zip archive containing the FastQC report, tab-delimited data file and plot images.

Trim Galore! is a wrapper tool around Cutadapt and FastQC to peform quality and adapter trimming on FastQ files. Trim Galore! will automatically detect and trim the appropriate adapter sequence. It is the default trimming tool used by this pipeline, however you can use fastp instead by specifying the --trimmer fastp parameter. You can specify additional options for Trim Galore! via the --extra_trimgalore_args parameters.

NB: TrimGalore! will only run using multiple cores if you are able to use more than > 5 and > 6 CPUs for single- and paired-end data, respectively. The total cores available to TrimGalore! will also be capped at 4 (7 and 8 CPUs in total for single- and paired-end data, respectively) because there is no longer a run-time benefit. See release notes and discussion whilst adding this logic to the nf-core/atacseq pipeline.

MultiQC - cutadapt trimmed sequence length plot

fastp

Output files

preprocessing/fastp/
- *.fastq.gz: If --save_trimmed is specified, FastQ files after adapter trimming will be placed in this directory.
- *.fastp.html: Trimming report in html format.
- *.fastp.json: Trimming report in json format.
preprocessing/fastp/log/
- *.fastp.log: Trimming log file.
preprocessing/fastp/fastqc/
- *_fastqc.html: FastQC report containing quality metrics for read 1 (and read2 if paired-end) after adapter trimming.
- *_fastqc.zip: Zip archive containing the FastQC report, tab-delimited data file and plot images.

fastp is a tool designed to provide fast, all-in-one preprocessing for FastQ files. It has been developed in C++ with multithreading support to achieve higher performance. fastp can be used in this pipeline for standard adapter trimming and quality filtering by setting the --trimmer fastp parameter. You can specify additional options for fastp via the --extra_fastp_args parameter.

MultiQC - fastp filtered reads plot

BBSplit

Output files

preprocessing/bbsplit/
- *.fastq.gz: If --save_bbsplit_reads is specified FastQ files split by reference will be saved to the results directory. Reads from the main reference genome will be named “primary.fastq.gz”. Reads from contaminating genomes will be named “<SHORT_NAME>.fastq.gz” where <SHORT_NAME> is the first column in --bbsplit_fasta_list that needs to be provided to initially build the index.
- *.txt: File containing statistics on how many reads were assigned to each reference.

BBSplit is a tool that bins reads by mapping to multiple references simultaneously, using BBMap. The reads go to the bin of the reference they map to best. There are also disambiguation options, such that reads that map to multiple references can be binned with all of them, none of them, one of them, or put in a special “ambiguous” file for each of them.

This functionality would be especially useful, for example, if you have mouse PDX samples that contain a mixture of human and mouse genomic DNA/RNA and you would like to filter out any mouse derived reads.

The BBSplit index will have to be built at least once with this pipeline by providing --bbsplit_fasta_list which has to be a file containing 2 columns: short name and full path to reference genome(s):

mm10,/path/to/mm10.fa
ecoli,/path/to/ecoli.fa
sarscov2,/path/to/sarscov2.fa

You can save the index by using the --save_reference parameter and then provide it via --bbsplit_index for future runs. As described in the Output files dropdown box above the FastQ files relative to the main reference genome will always be called *primary*.fastq.gz.

rRNA removal

When --remove_ribo_rna is specified (default: true), the pipeline removes ribosomal RNA reads using one of three tools, selected via the --ribo_removal_tool parameter.

SortMeRNA (default)

Output files

preprocessing/sortmerna/
- *.sortmerna.log: Log file generated by SortMeRNA with information regarding reads that matched the reference database(s).
- *.fastq.gz: If --save_non_ribo_reads is specified, FastQ files containing non-rRNA reads will be placed in this directory.

When --ribo_removal_tool sortmerna is used (the default), the pipeline uses SortMeRNA for the removal of ribosomal RNA. SortMeRNA uses k-mer matching against rRNA databases. By default, rRNA databases defined in the SortMeRNA GitHub repo are used. You can see an example in the pipeline GitHub repository in assets/rrna-db-defaults.txt which is used by default via the --ribo_database_manifest parameter. Please note that commercial/non-academic entities require licensing for SILVA for these default databases.

MultiQC - SortMeRNA hit count plot

Bowtie2

Output files

preprocessing/bowtie2_rrna/
- *.bowtie2.log: Log file generated by Bowtie2 with alignment statistics.
- *.fastq.gz: If --save_non_ribo_reads is specified, FastQ files containing non-rRNA reads will be placed in this directory.

When --ribo_removal_tool bowtie2 is specified, the pipeline uses Bowtie2 for alignment-based rRNA removal. Reads are aligned against rRNA reference sequences specified via --ribo_database_manifest, and reads that align to rRNA are filtered out. The unaligned reads (non-rRNA) are kept for downstream analysis.

RiboDetector

Warning

RiboDetector has known issues with ONNX multiprocessing that can cause hangs in containerized environments. We recommend using SortMeRNA or Bowtie2 instead. See hzi-bifo/RiboDetector#61 for details.

Output files

preprocessing/ribodetector/
- *.log: Log file generated by RiboDetector.
- *.fastq.gz: If --save_non_ribo_reads is specified, FastQ files containing non-rRNA reads will be placed in this directory.

When --ribo_removal_tool ribodetector is specified, the pipeline uses RiboDetector for machine learning-based rRNA detection. RiboDetector uses a pre-trained neural network to identify rRNA reads without requiring a reference database. This makes it particularly useful for organisms that lack well-characterized rRNA sequences, or when you want to avoid database licensing requirements. The tool automatically determines the appropriate read length from your data.

Alignment

STAR

Output files

alignment/star/original
- *.Aligned.out.bam: If --save_align_intermeds is specified the original BAM file containing read alignments to the reference genome will be placed in this directory.
- *.Aligned.toTranscriptome.out.bam: If --save_align_intermeds is specified the original BAM file containing read alignments to the transcriptome will be placed in this directory.
alignment/star/sorted
- *.genome.sorted.bam: Coordinate-sorted genome BAM file for each sample
- *.genome.sorted.bam.bai: Index for coordinate-sorted genome BAM file for each sample
- *.transcriptome.sorted.bam: Coordinate-sorted transcriptome BAM file for each sample
- *.transcriptome.sorted.bam.bai: Index for coordinate-sorted transcriptome BAM file for each sample
alignment/star/log/
- *.SJ.out.tab: File containing filtered splice junctions detected after mapping the reads.
- *.Log.final.out: STAR alignment report containing the mapping results summary.
- *.Log.out and *.Log.progress.out: STAR log files containing detailed information about the run. Typically only useful for debugging purposes.
alignment/star/unmapped/
- *.fastq.gz: If --save_unaligned is specified, FastQ files containing unmapped reads will be placed in this directory.
alignment/star/sorted/samtools_stats/
- SAMtools *.sorted.bam.flagstat, *.sorted.bam.idxstats and *.sorted.bam.stats files generated from the sorted alignment files.

STAR is a read aligner designed for splice aware mapping typical of RNA sequencing data. STAR stands for Spliced Transcripts Alignment to a Reference, and has been shown to have high accuracy and outperforms other aligners by more than a factor of 50 in mapping speed, but it is memory intensive. Using --aligner star_salmon is the default alignment and quantification option.

The STAR section of the MultiQC report shows a bar plot with alignment rates: good samples should have most reads as Uniquely mapped and few Unmapped reads.

MultiQC - STAR alignment scores plot

The original BAM files generated by the selected alignment algorithm are further processed with SAMtools to sort them by coordinate, for indexing, as well as to generate read mapping statistics.

MultiQC - SAMtools alignment scores plot

MultiQC - SAMtools mapped reads per contig plot

UMI dedup

Output files

alignment/star/deduplicated
- <SAMPLE>.umi_dedup.sorted.bam: If --save_umi_intermeds is specified the UMI deduplicated, coordinate sorted BAM file containing read alignments will be placed in this directory.
- <SAMPLE>.umi_dedup.sorted.bam.bai: If --save_umi_intermeds is specified the BAI index file for the UMI deduplicated, coordinate sorted BAM file will be placed in this directory.
- <SAMPLE>.umi_dedup.sorted.bam.csi: If --save_umi_intermeds --bam_csi_index is specified the CSI index file for the UMI deduplicated, coordinate sorted BAM file will be placed in this directory.
alignment/star/deduplicated/log/
- *_edit_distance.tsv: Reports the (binned) average edit distance between the UMIs at each position (UMI-tools only).
- *_per_umi.tsv: UMI-level summary statistics (UMI-tools only).
- *_per_umi_per_position.tsv: Tabulates the counts for unique combinations of UMI and position (UMI-tools only).
- *.log: log from UMI deduplication tool.

The content of the files above is explained in more detail in the UMI-tools documentation.

After extracting the UMI information from the read sequence (see UMI-tools extract), the second step in the removal of UMI barcodes involves deduplicating the reads based on both mapping and UMI barcode information. UMI deduplication can be carried out either with UMI-tools or UMICollapse, set via the umi_dedup_tool parameter. The output BAM files are the same, though UMI-tools has some additional outputs, as described above. Either method will generate a filtered BAM file after the removal of PCR duplicates.

Riboseq-specific QC

Read distribution metrics around annotated protein coding regions or based on alignments alone, plus related metrics.

Ribo-TISH quality

Output files

riboseq_qc/ribotish/
- *_qual.txt: text-format data on read distribution around annotated protein coding regions on user provided transcripts
- *_qual.pdf: PDF-format representation of read distribution around annotated protein coding regions on user provided transcripts
- *.para.py: P-site offsets for different reads lengths in python code dict format

Ribo-TISH quality control analyzes the distribution of ribosome-protected fragment (RPF) reads around annotated coding sequences. Key metrics include read length distribution and reading frame periodicity, which are essential for assessing ribosome profiling data quality.

MultiQC - Ribo-TISH read length distribution

MultiQC - Ribo-TISH frame proportions

Ribotricer detect-orfs QC outputs

Output files

riboseq_qc/ribotricer/
- *_read_length_dist.pdf: PDF-format read length distribution as quality control
- *_metagene_plots.pdf: Metagene plots for quality control
- *_protocol.txt: txt file containing inferred protocol if it was inferred (not supplied as input)
- *_metagene_profiles_5p.tsv: Metagene profile aligning with the start codon
- *_metagene_profiles_3p.tsv: Metagene profile aligning with the stop codon

ORF predictions

Ribo-TISH predict

Output files

orf_predictions/ribotish/
- *_pred.txt thresholded ORF predictions from Ribo-TISH
- *_all.txt unthresholded ORF predictions from Ribo-TISH
- *_transprofile.py RPF P-site profile for each transcript from Ribo-TISH
orf_predictions/ribotish_all/
- allsamples_pred.txt thresholded ORF predictions from Ribo-TISH ran over all samples at once
- allsamples_all.txt unthresholded ORF predictions from Ribo-TISH ran over all samples at once
- allsamples_transprofile.py RPF P-site profile for each transcript from Ribo-TISH ran over all samples at once

Ribotricer detect-orfs

Output files

orf_predictions/ribotricer/
- *_translating_ORFs.tsv TSV with ORFs assessed as translating in the assocciated BAM file
- *_psite_offsets.txt: If the P-site offsets are not provided, txt file containing the derived relative offsets

P-site identification

riboWaltz

P-sites are identified by passing transcriptome-level alignment BAM files to riboWaltz, producing the following outputs:

Output files

ribowaltz/
- *.best_offset.txt: Text file with the extremity used for the offset correction step and the best offset for each sample
- *.psite_offset.tsv: TSV file containing P-site offsets for each read length
- *.psite.tsv: TSV file containing P-site transcriptomic coordinates and information for each alignment
- *.codon_coverage_rpf.tsv: TSV file with codon-level RPF coverage for each transcript
- *.codon_coverage_psite.tsv: TSV file with codon-level P-site coverage for each transcript
- *.cds_coverage_psite.tsv: TSV file with CDS P-site in-frame counts for each transcript
- *nt_coverage_psite.tsv: TSV file with CDS P-site in-frame counts for each transcript, excluding P-sites within defined distances to start and stop codons (defined by passing --exclude_start and --exclude_stop with the number of nucleotides)
ribowaltz/offset_plot/<SAMPLE>/
- *.pdf: P-site offset plots for each read length
ribowaltz/ribowaltz_qc/
- *.length_distribution.pdf: Distribution of read lengths per sample
- *.ends_heatmap.pdf: Meta-heatmaps of read extremities around start and stop codons for each read length
- *.length_bins_for_psite.pdf: Distribution of read lengths used to derive P-site offsets per sample
- *.psite_region.pdf: P-site region distribution per sample
- *.metaprofile_psite.pdf: P-site metaprofiles around start and stop codons per sample
- *.frames.pdf: P-site frame distribution per sample
- *.frames_stratified.pdf: P-site frame distribution for each read length per sample
- *.codon_usage.pdf: Codon usage index per sample

Quantification

Quantification is done by passing transcriptome-level alignment BAM files to Salmon, producing the following outputs:

Output files

quantification/
- tx2gene.tsv: Tab-delimited file containing gene to transcripts ids mappings.
quantification/salmon/
- salmon.merged.gene_counts.tsv`: Matrix of gene-level raw counts across all samples.
- salmon.merged.gene_tpm.tsv`: Matrix of gene-level TPM values across all samples.
- salmon.merged.gene.rds: RDS object that can be loaded in R that contains a [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) container with the TPM (abundance), estimated counts (counts) and transcript length (length`) in the assays slot for genes.
- salmon.merged.gene_counts_scaled.tsv`: Matrix of gene-level library size-scaled counts across all samples.
- salmon.merged.gene__scaled.rds: RDS object that can be loaded in R that contains a [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) container with the TPM (abundance), estimated library size-scaled counts (counts) and transcript length (length`) in the assays slot for genes.
- salmon.merged.gene_counts_length_scaled.tsv`: Matrix of gene-level length-scaled counts across all samples.
- salmon.merged.gene_length_scaled.rds: RDS object that can be loaded in R that contains a [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) container with the TPM (abundance), estimated length-scaled counts (counts) and transcript length (length`) in the assays slot for genes.
- salmon.merged.transcript_counts.tsv`: Matrix of isoform-level raw counts across all samples.
- salmon.merged.transcript_tpm.tsv`: Matrix of isoform-level TPM values across all samples.
- salmon.merged.transcript.rds: RDS object that can be loaded in R that contains a [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) container with the TPM (abundance), estimated isoform-level raw counts (counts) and transcript length (length`) in the assays slot for transcripts.

Raw outputs from Salmon are available for each sample:

Output files

quantification/salmon/<SAMPLE>/
- aux_info/: Auxiliary info e.g. versions and number of mapped reads.
- cmd_info.json: Information about the Salmon quantification command, version and options.
- lib_format_counts.json: Number of fragments assigned, unassigned and incompatible.
- libParams/: Contains the file flenDist.txt for the fragment length distribution.
- logs/: Contains the file salmon_quant.log giving a record of Salmon’s quantification.
- quant.genes.sf: Salmon gene-level quantification of the sample, including feature length, effective length, TPM, and number of reads.
- quant.sf: Salmon transcript-level quantification of the sample, including feature length, effective length, TPM, and number of reads.

Translational efficiency

The pipeline supports two methods for translational efficiency analysis: anota2seq (default) and deltaTE. The method used depends on the --translational_efficiency_method parameter.

anota2seq outputs

anota2seq produces the following outputs:

Output files

translational_efficiency/anota2seq
- *.total_mRNA.anota2seq.results.tsv: anota2seq results for the ‘total mRNA’ analysis, describing differences in RNA levels across conditions for RNA-seq samples. See https://rdrr.io/bioc/anota2seq/man/anota2seqGetOutput.html for description of output columns.
- *.translated_mRNA.anota2seq.results.tsv: anota2seq results for the ‘translated mRNA’ analysis, describing differences in RNA levels across conditions for Ribo-seq samples. See https://rdrr.io/bioc/anota2seq/man/anota2seqGetOutput.html for description of output columns.
- *.mRNA_abundance.anota2seq.results.tsv: anota2seq results for the ‘mRNA abundance’ analysis, describing changes across conditions consistent between total mRNA and translated RNA (RNA-seq and Ribo-seq samples). See https://rdrr.io/bioc/anota2seq/man/anota2seqGetOutput.html for description of output columns.
- *.buffering.anota2seq.results.tsv: anota2seq results for the ‘buffering’ analysis, describing stable levels of translated RNA (from Ribo-seq samples) across conditions, despite changes in total mRNA. See https://rdrr.io/bioc/anota2seq/man/anota2seqGetOutput.html for description of output columns.
- *.translation.anota2seq.results.tsv: anota2seq results for the ‘translation’ analysis, describing differences in translation across conditions, being differences in translated RNA levels not explained by total RNA levels. See https://rdrr.io/bioc/anota2seq/man/anota2seqGetOutput.html for description of output columns.
- *.fold_change.png: A fold change plot in PNG format, from anota2seq’s anota2seqPlotFC() method.
- *.interaction_p_distribution.pdf: The distribution of p-values and adjusted p-values for the omnibus interaction (both using densities and histograms). The second page of the pdf displays the same plots but for the RVM statistics if RVM is used.
- *.residual_distribution_summary.jpeg: Summary plot for assessing normal distribution of regression residuals.
- *.residual_vs_fitted.jpeg: QC plot showing residuals against fitted values.
- *.rvm_fit_for_all_contrasts_group.jpg: QC plot showing the CDF of variance (theoretical vs empirical), all contrasts.
- *.rvm_fit_for_interactions.jpg: QC plot showing the CDF of variance (theoretical vs empirical), for interactions.
- *.rvm_fit_for_omnibus_group.jpg: QC plot showing the CDF of variance (theoretical vs empirical), for omnibus group.
- *.simulated_vs_obt_dfbetas_without_interaction.pdf: Bar graphs of the frequencies of outlier dfbetas using different dfbetas thresholds.
- *.Anota2seqDataSet.rds: Serialised Anota2seqDataSet object.
- *.R_sessionInfo.log: dump of R SessionInfo

deltaTE outputs

When using --translational_efficiency_method deltate, the pipeline produces the following outputs:

Output files

translational_efficiency/deltate
- *.total_mRNA.deltate.results.tsv: DESeq2 results for RNA-seq analysis, describing differences in total mRNA levels across conditions.
- *.translated_mRNA.deltate.results.tsv: DESeq2 results for Ribo-seq analysis, describing differences in ribosome-protected RNA levels across conditions.
- *.translation.deltate.results.tsv: DESeq2 results for the interaction term, describing changes in translational efficiency across conditions.
- *.dtegs.deltate.genes.tsv: Gene list of all differentially translated genes (DTEGs) with significant changes in translational efficiency.
- *.mRNA_abundance.deltate.genes.tsv: Gene list of genes showing coordinated changes between RNA-seq and Ribo-seq without net translational efficiency changes (equivalent to anota2seq mRNA abundance).
- *.translation.deltate.genes.tsv: Gene list of genes with pure translational regulation - Ribo-seq changes without RNA-seq changes (equivalent to anota2seq translation category).
- *.buffering.deltate.genes.tsv: Gene list of genes with buffering effects where translation dampens RNA changes (equivalent to anota2seq buffering).
- *.intensified.deltate.genes.tsv: Gene list of genes where translation amplifies RNA changes in the same direction (deltaTE-specific category).
- *.fold_change.png: Scatter plot showing RNA-seq vs Ribo-seq log2 fold changes, colored by gene classification (harmonized with anota2seq visualization).
- *.interaction_p_distribution.png: Histogram of interaction term adjusted p-values for assessing translational regulation significance patterns.
- *.pca_ribo.png: PCA plot for Ribo-seq samples.
- *.pca_rna.png: PCA plot for RNA-seq samples.
- *.heatmap.png: Heatmap of top differentially translated genes (DTEGs).
- *.DESeqDataSet.rds: Serialised DESeqDataSet object containing analysis results.
- *.R_sessionInfo.log: R session information for reproducibility.

Fold change plots

Both methods produce fold change plots that visualize the relationship between RNA-seq and Ribo-seq changes. The examples below were generated using the test_full profile on the same dataset, allowing direct comparison between methods.

anota2seq fold change plot:

anota2seq - fold change plot

The anota2seq fold change plot shows total mRNA log2FC on the X-axis and translated mRNA log2FC on the Y-axis, with genes colored by their regulatory classification. The solid diagonal line represents equal changes in both data types.

In the test dataset: The plot reveals 623 genes with significant translational regulation across several categories:

Translation down (357) / up (225): Genes showing pure translational regulation without corresponding RNA changes, appearing as vertical deviations from the diagonal
mRNA abundance down (16) / up (25): Genes with coordinated RNA and Ribo-seq changes, appearing along the diagonal
Buffered (mRNA down) (0) / (mRNA up) (0): Genes where translation dampens RNA changes (none detected in this dataset)

The predominance of translation-regulated genes (582 of 623) indicates that translational control is the dominant regulatory mechanism in this experimental system.

deltaTE scatter plot:

deltaTE - fold change plot

The deltaTE method produces a similar visualization showing total mRNA log2FC (RNA-seq) on the X-axis and translated mRNA log2FC (Ribo-seq) on the Y-axis, with genes colored by their classification. The diagonal dashed line represents equal changes in both data types.

In the test dataset: The plot reveals 1,694 DTEGs across several categories:

Translation down (895) / up (710): Genes showing translational regulation without corresponding RNA changes, appearing as vertical deviations from the diagonal along the y-axis
mRNA abundance down (4) / up (30): Genes with coordinated RNA and Ribo-seq changes, appearing along the diagonal
Buffering down (14) / up (16): Genes where translation dampens RNA changes, appearing below the diagonal
Intensified (25): Genes where translation amplifies RNA changes, appearing above the diagonal

The striking vertical distribution of points (translation category) demonstrates that translational regulation is the dominant mode of gene expression control in this experimental system, with most changes occurring in Ribo-seq independently of RNA-seq.

Diagnostic plots

Both methods provide diagnostic plots for quality control and model assessment:

anota2seq diagnostic plots

anota2seq produces several QC plots for assessing model fit and statistical assumptions.

Residual distribution summary:

anota2seq - residual distribution summary

This plot summarizes the distribution of regression residuals across quantiles of the standard normal distribution. It helps assess whether the residuals follow the expected normal distribution, which is an assumption of the linear model.

In the test dataset: The obtained outlier rate (1.588%) is close to the expected 1%, indicating the model assumptions are reasonably well met.

Residuals vs fitted values:

anota2seq - residuals vs fitted

This classic diagnostic plot shows residuals plotted against fitted values. Ideally, residuals should be randomly scattered around zero with constant variance (homoscedasticity).

In the test dataset: The residuals show a characteristic funnel shape with greater spread at lower fitted values, which is typical for count-based expression data. This heteroscedasticity is expected and accounted for by the RVM approach.

RVM fit for interactions:

anota2seq - RVM fit for interactions

The Random Variance Model (RVM) fit plot shows how well the empirical variance distribution matches the theoretical F-distribution. The left panel shows a Q-Q plot of empirical vs theoretical quantiles, and the right panel shows the cumulative distribution functions with a Kolmogorov-Smirnov test p-value.

In the test dataset: The KS p-value of 4.24e-09 indicates significant deviation from the theoretical distribution, suggesting the variance structure in this dataset differs from standard assumptions. This is not uncommon in real experimental data and the RVM approach helps account for such deviations.

deltaTE visualizations

The deltaTE method produces diagnostic plots for quality control and interpretation:

Fold change plot: The main visualization showing RNA-seq vs Ribo-seq log2 fold changes, with genes colored by their deltaTE classification (mRNA_abundance, translation, buffering, intensified). This provides direct visual comparison with anota2seq’s equivalent plot.
Interaction p-value distribution: Histogram showing the distribution of adjusted p-values from the DESeq2 interaction term analysis.
PCA plots: Principal component analysis plots for both Ribo-seq and RNA-seq samples to assess sample relationships and identify potential batch effects.
Heatmap: Clustered heatmap of the most significantly regulated genes, showing expression patterns across samples.

PCA plots:

deltaTE - Ribo-seq PCA

deltaTE - RNA-seq PCA

PCA plots help assess sample clustering and identify potential batch effects or outliers. In well-designed experiments, samples should cluster primarily by treatment condition rather than technical factors.

In the test dataset: Both PCA plots show clear separation between control (coral) and treated (teal) samples along PC1, which captures 57% of variance for Ribo-seq and 39% for RNA-seq. This indicates that the treatment effect is the dominant source of variation in both data types. The treated samples cluster together on the right side of PC1 in both plots, while control samples are distributed on the left, with one control sample (SRX11780879 in RNA-seq, SRX11780886 in Ribo-seq) showing some separation from other controls along PC2.

Expression heatmap:

deltaTE - Expression heatmap

This clustered heatmap shows the expression patterns of the top 50 differentially translated genes (DTEGs) across all samples, with rows representing genes and columns representing samples. The color scale indicates normalized expression levels (Z-scores). Samples are annotated by condition (control/treated) and sequencing type (rnaseq/riboseq).

In the test dataset: The heatmap reveals distinct expression patterns between control and treated samples. Samples cluster hierarchically first by condition (control vs treated), with Ribo-seq control samples on the far left showing predominantly high expression (red), and treated samples showing lower expression (blue/purple) for most genes. The top 50 DTEGs show strong coordinated downregulation in treated samples compared to controls, particularly evident in the Ribo-seq data, reflecting the predominance of translation-down regulation in this dataset.

Interpreting deltaTE diagnostic plots

Interaction p-value distribution:

deltaTE - interaction p-value distribution

This histogram shows the distribution of adjusted p-values from the interaction term in the DESeq2 model, which tests for differences in translational efficiency between conditions. For well-powered experiments with genuine translational regulation:

You should see a mixture of p-values with enrichment near 0 (significant genes) and relatively uniform distribution across higher p-values
A completely uniform distribution suggests no translational regulation effects
Heavy skewing toward 1.0 may indicate underpowered analysis or technical issues

In the test dataset: The distribution shows strong enrichment of small adjusted p-values (large peak near 0), indicating widespread translational regulation in this experimental system. The dashed red line marks the significance threshold (α = 0.05). The analysis identified 1,694 differentially translated genes (DTEGs), with translational regulation being the dominant mode of gene expression control.

The deltaTE approach focuses on the key fold change visualization that directly shows the relationship between RNA and ribosome changes, making it easy to interpret translational regulation patterns. The intensified category (genes where translation amplifies RNA changes in the same direction) is unique to deltaTE and provides additional biological insights not available in anota2seq.

MultiQC

Output files

multiqc/
- multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
- multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.

MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools i.e. FastQC, Cutadapt, SortMeRNA, STAR, RSEM, HISAT2, Salmon, SAMtools, Picard, RSeQC, Qualimap, Preseq and featureCounts. Additionally, various custom content has been added to the report to assess the output of dupRadar, DESeq2 and featureCounts biotypes, and to highlight samples failing a mimimum mapping threshold or those that failed to match the strand-specificity provided in the input samplesheet. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Workflow reporting and genomes

Reference genome files

Output files

genome/
- *.fa, *.gtf, *.gff, *.bed, .tsv: If the --save_reference parameter is provided then all of the genome reference files will be placed in this directory.
genome/index/
- star/: Directory containing STAR indices.
- hisat2/: Directory containing HISAT2 indices.

A number of genome-specific files are generated by the pipeline because they are required for the downstream processing of the results. If the --save_reference parameter is provided then these will be saved in the genome/ directory. It is recommended to use the --save_reference parameter if you are using the pipeline to build new indices so that you can save them somewhere locally. The index building step can be quite a time-consuming process and it permits their reuse for future runs of the pipeline to save disk space.

Pipeline information

Output files

pipeline_info/
- Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
- Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

nf-core/riboseq Edit

Introduction

Pipeline overview

Preprocessing

cat

fq lint

FastQC

UMI-tools extract

TrimGalore

fastp

BBSplit

rRNA removal

SortMeRNA (default)

Bowtie2

RiboDetector

Alignment

STAR

UMI dedup

Riboseq-specific QC

Ribo-TISH quality

Ribotricer detect-orfs QC outputs

ORF predictions

Ribo-TISH predict

Ribotricer detect-orfs

P-site identification

riboWaltz

Quantification

Translational efficiency

anota2seq outputs

deltaTE outputs

Fold change plots

Diagnostic plots

anota2seq diagnostic plots

deltaTE visualizations

Interpreting deltaTE diagnostic plots

MultiQC

Workflow reporting and genomes

Reference genome files

Pipeline information

nf-core/riboseq
Edit