nf-core/dualrnaseq
Analysis of Dual RNA-seq data - an experimental method for interrogating host-pathogen interactions through simultaneous RNA-seq.
22.10.6
.
Learn more.
Output
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
FastQC
FastQC gives general quality metrics about your reads. It provides information about the quality score distribution across your reads and the per base sequence content (%T/A/G/C). Information about adapter contamination and other overrepresented sequences is also displayed.
Output directory for raw reads: results/fastqc
Output directory for trimmed reads: results/fastqc_after_trimming
Contents:
sample_fastqc.html
andsample_trimmed_fastqc.html
- FastQC reports,
sample_fastqc.html
andsample_trimmed_fastqc.html
contain quality metrics for your untrimmed raw and trimmed fastq files, respectively.
- FastQC reports,
zips/sample_fastqc.zip
andzips/sample_trimmed_fastqc.zip
- zip file containing the FastQC report, tab-delimited data file and plot images
For further reading and documentation see the FastQC help.
Trimming reads
Two software options are provided to remove low quality reads and adapters: Cutadapt and BBDuk.
All samples are processed in the output directory and appended with _trimmed.fastq.gz
Output directory: results/trimming
Mapping/quantification
Depending on which mapping/quantification mode was selected, will depend on which folders and files appear here.
A) Salmon selective alignment
Output directory: results/salmon
Contents:
transcripts_index
- All files produced by Salmon in the indexing phase.
- subfolders named as samples.
- All files and folders produced by Salmon in the quantification step of Selective Alignment mode. You can learn more about salmon outputs in
Salmon documentation
. Each subfolder also contains separated quantification results from the host and pathogen stored inhost_quant.sf
andpathogen_quant.sf
files, respectively.
- All files and folders produced by Salmon in the quantification step of Selective Alignment mode. You can learn more about salmon outputs in
combined_quant.tsv
- Tab delimited file containing combined quantification results of all samples processed by the pipeline.
host_quant_salmon.tsv
andpathogen_quant_salmon.tsv
- Tab delimited files containing combined quantification results for either host or pathogen from all samples.
host_combined_quant_annotations.tsv
andpathogen_combined_quant_annotations.tsv
- Tab delimited files containing quantification results for either host or pathogen and annotations extracted from gff files including transcript_id, transcript_name, gene id, gene_name and gene_type.
host_combined_gene_level.tsv
- Host gene-level estimates obtained using tximport.
host_combined_quant_gene_level_annotations.tsv
- Tab delimited file containing host gene-level estimates and annotations extracted from gff files including gene id, gene_name and gene_type.
B) STAR + Salmon - alignment based
Output directory: results/STAR_for_salmon
Contents:
index
- All files produced by STAR in the indexing phase.
- subfolders named as samples.
- Files produced by STAR in the alignment step. There are two bam files generated in this mapping mode: unsorted genome alignment
sampleAligned.out.bam
andsampleAligned.sample_toTranscriptome.out.bam
which contains alignments translated into transcript coordinates. SeeOutput in transcript coordinates
from theSTAR documentation.
- Files produced by STAR in the alignment step. There are two bam files generated in this mapping mode: unsorted genome alignment
Output directory: results/salmon_alignment_mode
Contents:
- subfolders named as samples
- All files and folders produced by Salmon in the quantification step of Selective Alignment mode. You can learn more about salmon outputs in
Salmon documentation
. Each subfolder also contains separated quantification results from the host and pathogen stored inhost_quant.sf
andpathogen_quant.sf
files, respectively.
- All files and folders produced by Salmon in the quantification step of Selective Alignment mode. You can learn more about salmon outputs in
combined_quant.tsv
- Tab delimited file containing combined quantification results of all samples processed by the pipeline.
host_quant_salmon.tsv
andpathogen_quant_salmon.tsv
- Tab delimited files containing combined quantification results for either host or pathogen from all samples.
host_combined_quant_annotations.tsv
andpathogen_combined_quant_annotations.tsv
- Tab delimited files containing quantification results for either host or pathogen and annotations extracted from gff files including transcript_id, transcript_name, gene id, gene_name and gene_type.
host_combined_gene_level.tsv
- Host gene-level estimates obtained using tximport.
host_combined_quant_gene_level_annotations.tsv
- Tab delimited file containing host gene-level estimates and annotations extracted from gff file including gene id, gene_name and gene_type.
C) STAR + HTSeq
Output directory: results/STAR
Contents:
index
- All files produced by STAR in the indexing phase.
- subfolders named as samples.
- All files produced by STAR in the alignment step.
sampleAligned.sortedByCoord.out.bam
contains genome alignment sorted by coordinates.
- All files produced by STAR in the alignment step.
multimapped_reads
- This folder contains both
sample_cross_mapped_reads.txt
file with a list of cross-mapped reads between host and pathogen andsample_no_crossmapped.bam
files with alignment without cross-mapped reads.
- This folder contains both
Output directory: results/HTSeq
Contents:
sample_count_u_m.txt
- Quantification results for a sample.
quantification_results_uniquely_mapped.tsv
- Tab delimited file containing combined quantification results of all samples processed by the pipeline.
quantification_results_uniquely_mapped_NumReads_TPM.tsv
- Tab delimited file containing HTSeq quantification results and TPM values estimated for each gene in each sample. To calculate TPM values, the length of genes is calculated using gff annotation.
quantification_stats_uniquely_mapped.tsv
- Statistics extracted from HTSeq quantification results.
host_quantification_uniquely_mapped_htseq.tsv
andpathogen_quantification_uniquely_mapped_htseq.tsv
- Tab delimited files containing combined quantification results for either host or pathogen from all samples.
host_combined_quant_annotations.tsv
andpathogen_combined_quant_annotations.tsv
- Tab delimited files containing quantification results for either host or pathogen and annotations extracted from gff files including gene_id, gene_name, gene_type and gene length.
References
The pipeline creates chimeric references of the pathogen and host, and here you can find references used for the mapping, quantification and collecting statistics.
Contents:
Salmon Selective Alignment references
host_pathogen.fasta
:- Chimeric genome fasta file used as a decoy sequence in Salmon with Selective Alignment. You can find more information about the Selective Alignment algorithm in salmon documentation.
decoys.txt
- List of decoy sequences generated by Salmon
*_parent_attribute.gff3
:- Host or pathogen gff file containing ‘parent’ insted of gene attribute defined by
--gene_attribute_gff_to_create_transcriptome_host
or--gene_attribute_gff_to_create_transcriptome_pathogen
, respectively. These files are used to extract annotations for each transcript.
- Host or pathogen gff file containing ‘parent’ insted of gene attribute defined by
*_transcriptome.fasta
- Either host or pathogen transcriptome fasta file created by the pipeline.
host_pathogen_transcriptome.fasta
- Merged host and pathogen transcriptomes.
gentrome.fasta
- Merged decoy sequences and transcriptome. File is generated by Salmon.
host_gff_annotations_annotations_salmon.tsv
andpathogen_gff_annotations_parent_salmon.tsv
- Annotations extracted from gff files for host or pathogen transcripts, respectively.
*_parent_attribute_quant_feature_salmon_alignment.gff3
- Host gff file containing ‘parent’ in place of genome attribute defined with
--gene_attribute_gff_to_create_transcriptome_host
andquant
replacing gene features defined by--gene_feature_gff_to_create_transcriptome_host
. The file is used to extract annotations for each transcript.
- Host gff file containing ‘parent’ in place of genome attribute defined with
Salmon Alignment-based mode references
host_pathogen.fasta
:- Chimeric genome fasta file used for read alignment with STAR.
*_parent_attribute.gff3
:- Host or pathogen gff file containing ‘parent’ insted of gene attribute defined by
--gene_attribute_gff_to_create_transcriptome_host
or--gene_attribute_gff_to_create_transcriptome_pathogen
, respectively. These files are used to extract annotations for each transcript.
- Host or pathogen gff file containing ‘parent’ insted of gene attribute defined by
*_transcriptome.fasta
- Either host or pathogen transcriptome fasta file created by the pipeline.
host_pathogen_transcriptome.fasta
- Merged host and pathogen transcriptomes.
host_gff_annotations_annotations_salmon.tsv
andpathogen_gff_annotations_parent_salmon.tsv
- Annotations extracted from gff files for host or pathogen transcripts, respectively.
*_parent_attribute_quant_feature_salmon_alignment.gff3
- Host gff file containing ‘parent’ in place of genome attribute defined with
--gene_attribute_gff_to_create_transcriptome_host
andquant
replacing gene features defined by--gene_feature_gff_to_create_transcriptome_host
. The file is used to extract annotations for each transcript.
- Host gff file containing ‘parent’ in place of genome attribute defined with
reference_host_names.txt
andreference_pathogen_names.txt
- List of either host or pathogen reference names defined in the 1st column of gff file. Files are used to identify uniquely mapped, cross-mapped and multi-mapped reads in alignments generated by Star.
STAR and HTSeq references
host_pathogen.fasta
:- Chimeric genome fasta file used for read alignment.
*_quant_feature.gff3
:- Either host or pathogen gff file containing
quant
in place of gene features defined by--gene_feature_gff_to_quantify_host
or--gene_feature_gff_to_quantify_pathogen
, respectively.
- Either host or pathogen gff file containing
*_quant_feature_new_attribute.gff3
- Pathogen gff file containing
quant
as gene feature and host gene attribute defined by--host_gff_atribute
in place of pathogen attribute specified with--pathogen_gff_atribute
.
- Pathogen gff file containing
host_pathogen_htseq.gff
- Chimeric gff file used for quantification.
*_htseq.tsv
- Annotations extracted from gff files for host or pathogen genes.
reference_host_names.txt
andreference_pathogen_names.txt
- List of either host or pathogen reference names defined in the 1st column of gff file. Files are used to identify uniquely mapped, cross-mapped and multi-mapped reads in alignments generated by Star.
Mapping statistics
Depending on which mapping/quantification mode was selected, will depend on which folders and files appear here. In general, files and images within these folders show the number of reads, mapping statistics for both the host and pathogen, RNA-class statistics, as well as gene and transcript-based metrics.
Output directory: results/mapping_statistics
Contents:
STAR: results/mapping_statistics/STAR
-
star_mapping_stats.tsv
- Tab delimited file containing mapping statistics collected from all samples.
-
mapping_stats_samples_total_reads.tsv
- Set of mapping and quantification statistics extracted from
star_mapping_stats.tsv
table used to createmapping_stats_samples_total_reads.pdf
plot.
- Set of mapping and quantification statistics extracted from
-
mapping_stats_samples_total_reads.pdf
- Visualisation of mapping statistics from
mapping_stats_samples_total_reads.tsv
table.
- Visualisation of mapping statistics from
-
mapping_stats_samples_percentage.tsv
- Mapping statistics from
mapping_stats_samples_total_reads.tsv
table expressed in percentage.
- Mapping statistics from
-
mapping_stats_samples_percentage.pdf
- Visualisation of mapping statistics from
mapping_stats_samples_percentage.tsv
table.
- Visualisation of mapping statistics from
HTSeq: results/mapping_statistics/HTSeq
-
scatter_plots
- Scatter plots showing correlations between TPM values of replicates within the same condition. The Pearson correlation coefficient is calculated using untransformed data.
-
htseq_uniquely_mapped_reads_stats.tsv
- Collection of mapping and quantification statistics including number of reads uniquely and multi-mapped to either host or pathogen using STAR, number of cross-mapped reads, unmapped reads, trimmed reads and number of assigned reads to the pathogen and host by HTSeq.
-
mapping_stats_samples_total_reads.tsv
- Set of mapping and quantification statistics extracted from
htseq_uniquely_mapped_reads_stats.tsv
table used to createmapping_stats_samples_total_reads.pdf
plot.
- Set of mapping and quantification statistics extracted from
-
mapping_stats_samples_total_reads.pdf
- Visualisation of mapping statistics from
mapping_stats_samples_total_reads.tsv
table.
- Visualisation of mapping statistics from
-
mapping_stats_samples_percentage.tsv
- Mapping and quantification statistics from
mapping_stats_samples_total_reads.tsv
table expressed in percentage.
- Mapping and quantification statistics from
-
mapping_stats_samples_percentage.pdf
- Visualisation of mapping statistics from
mapping_stats_samples_percentage.tsv
table.
- Visualisation of mapping statistics from
-
RNA_classes_pathogen
-
pathogen_RNA_classes_sum_counts_htseq.tsv
- Tab delimited file containing pathogen RNA class statistics for each sample. Sum of number of reads assigned to genes which belong to a specific RNA class.
-
pathogen_RNA_classes_percentage_htseq.tsv
- Pathogen RNA class statistics from
pathogen_RNA_classes_sum_counts_htseq.tsv
table expressed in percentage.
- Pathogen RNA class statistics from
-
RNA_class_stats_combined_pathogen.pdf
- Plot showing pathogen RNA class statistics for all samples from
pathogen_RNA_classes_percentage_htseq.tsv
table.
- Plot showing pathogen RNA class statistics for all samples from
-
sample.pdf
- Visualization of pathogen RNA classes statistics for a sample.
-
-
RNA_classes_host
-
host_RNA_classes_sum_counts_htseq.tsv
- Tab delimited file containing host RNA class statistics for each sample.
-
host_RNA_classes_percentage_htseq.tsv
- Host RNA class statistics from
host_RNA_classes_sum_counts_htseq.tsv
table expressed in percentage.
- Host RNA class statistics from
-
host_gene_types_groups_gene.tsv
- Tab delimited file containing list of host genes, including gene ids, gene names, counts obtained from quantification, and gene types assigned to each gene considering RNA class groups defined by
--RNA_classes_to_replace_host
. For more information check parameters.md. This table is created only for examination of results.
- Tab delimited file containing list of host genes, including gene ids, gene names, counts obtained from quantification, and gene types assigned to each gene considering RNA class groups defined by
-
RNA_class_stats_combined_host.pdf
- Visualization of host RNA class statistics for all samples from
host_RNA_classes_percentage_htseq.tsv
table.
- Visualization of host RNA class statistics for all samples from
-
sample.pdf
- Visualization of host RNA classes statistics for a sample.
-
Salmon: results/mapping_statistics/salmon
-
scatter_plots
- Scatter plots showing correlations between TPM values of replicates within the same condition. The Pearson correlation coefficient is calculated using untransformed data.
-
salmon_host_pathogen_total_reads.tsv
- Tab delimited file containing mapping statistics collected from all samples.
-
mapping_stats_samples_total_reads.tsv
- Set of mapping and quantification statistics extracted from
salmon_host_pathogen_total_reads.tsv
table used to createmapping_stats_samples_total_reads.pdf
plot.
- Set of mapping and quantification statistics extracted from
-
mapping_stats_samples_total_reads.pdf
- Visualisation of mapping statistics from
mapping_stats_samples_total_reads.tsv
table.
- Visualisation of mapping statistics from
-
mapping_stats_samples_percentage.tsv
- Mapping statistics from
mapping_stats_samples_total_reads.tsv
table expressed in percentage.
- Mapping statistics from
-
mapping_stats_samples_percentage.pdf
- Visualisation of mapping statistics from
mapping_stats_samples_percentage.tsv
table.
- Visualisation of mapping statistics from
-
RNA_classes_pathogen
-
pathogen_RNA_classes_sum_counts_salmon.tsv
- Tab delimited file containing pathogen RNA class statistics for each sample.
-
pathogen_RNA_classes_percentage_salmon.tsv
- Pathogen RNA class statistics from
pathogen_RNA_classes_sum_counts_salmon.tsv
table expressed in percentage.
- Pathogen RNA class statistics from
-
RNA_class_stats_combined_pathogen.pdf
- Plot showing pathogen RNA class statistics for all samples from
pathogen_RNA_classes_percentage_htseq.tsv
table.
- Plot showing pathogen RNA class statistics for all samples from
-
sample.pdf
- Visualization of pathogen RNA classes statistics for a sample.
-
-
RNA_classes_host
-
host_RNA_classes_sum_counts_salmon.tsv
- Tab delimited file containing host RNA class statistics for each sample.
-
host_RNA_classes_percentage_salmon.tsv
- Host RNA class statistics from
host_RNA_classes_sum_counts_salmonhead.tsv
table expressed in percentage.
- Host RNA class statistics from
-
host_gene_types_groups_transcript.tsv
- Tab delimited file containing list of host transcripts, including transcript_id, transcript_name, gene ids, gene names, counts obtained from quantification, and gene types assigned to each gene considering RNA class groups defined by
--RNA_classes_to_replace_host
. For more information check parameters.md. This table is created only for examination of results.
- Tab delimited file containing list of host transcripts, including transcript_id, transcript_name, gene ids, gene names, counts obtained from quantification, and gene types assigned to each gene considering RNA class groups defined by
-
RNA_class_stats_combined_host.pdf
- Visualization of host RNA class statistics for all samples from
host_RNA_classes_percentage_salmon.tsv
table.
- Visualization of host RNA class statistics for all samples from
-
sample.pdf
- Visualization of host RNA classes statistics for a sample.
-
Salmon alignment based: results/mapping_statistics/salmon_alignment_based
-
scatter_plots
- Scatter plots showing correlations between TPM values of replicates within the same condition. The Pearson correlation coefficient is calculated using untransformed data.
-
salmon_alignment_host_pathogen_total_reads.tsv
- Collection of mapping and quantification statistics including number of reads uniquely and multi-mapped to either host or pathogen transcriptome using STAR, unmapped reads, trimmed reads and number of assigned reads to the pathogen and host by HTSeq.
-
mapping_stats_samples_total_reads.tsv
- Set of mapping and quantification statistics extracted from
salmon_alignment_host_pathogen_total_reads.tsv
table used to createmapping_stats_samples_total_reads.pdf
plot.
- Set of mapping and quantification statistics extracted from
-
mapping_stats_samples_total_reads.pdf
- Visualisation of mapping statistics from
mapping_stats_samples_total_reads.tsv
table.
- Visualisation of mapping statistics from
-
mapping_stats_samples_percentage.tsv
- Mapping statistics from
mapping_stats_samples_total_reads.tsv
table expressed in percentage.
- Mapping statistics from
-
mapping_stats_samples_percentage.pdf
- Visualisation of mapping statistics from
mapping_stats_samples_percentage.tsv
table.
- Visualisation of mapping statistics from
-
RNA_classes_pathogen
-
pathogen_RNA_classes_sum_counts_salmon.tsv
- Tab delimited file containing pathogen RNA class statistics for each sample.
-
pathogen_RNA_classes_percentage_salmon.tsv
- Pathogen RNA class statistics from
pathogen_RNA_classes_sum_counts_salmon.tsv
table expressed in percentage.
- Pathogen RNA class statistics from
-
RNA_class_stats_combined_pathogen.pdf
- Plot showing pathogen RNA class statistics for all samples from
pathogen_RNA_classes_percentage_htseq.tsv
table.
- Plot showing pathogen RNA class statistics for all samples from
-
sample.pdf
- Visualization of pathogen RNA classes statistics for a sample.
-
-
RNA_classes_host
-
host_RNA_classes_sum_counts_salmon.tsv
- Tab delimited file containing host RNA class statistics for each sample.
-
host_RNA_classes_percentage_salmon.tsv
- Host RNA class statistics from
host_RNA_classes_sum_counts_salmon.tsv
table expressed in percentage.
- Host RNA class statistics from
-
host_gene_types_groups_transcript.tsv
- Tab delimited file containing list of host transcripts, including transcript_id, transcript_name, gene ids, gene names, counts obtained from quantification, and gene types assigned to each gene considering RNA class groups defined by
--RNA_classes_to_replace_host
. For more information check parameters.md. This table is created only for examination of results.
- Tab delimited file containing list of host transcripts, including transcript_id, transcript_name, gene ids, gene names, counts obtained from quantification, and gene types assigned to each gene considering RNA class groups defined by
-
RNA_class_stats_combined_host.pdf
- Visualization of host RNA class statistics for all samples from
host_RNA_classes_percentage_salmon.tsv
table.
- Visualization of host RNA class statistics for all samples from
-
sample.pdf
- Visualization of host RNA classes statistics for a sample.
-
MultiQC
MultiQC is a visualization tool that generates a single HTML report summarizing all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability.
Output directory: results/MultiQC
Contents:
ProjectName_multiqc_report.html
- MultiQC report - a standalone HTML file that can be viewed in your web browser
ProjectName_multiqc_report_data/
- Directory containing parsed statistics from the different tools used in the pipeline
multiqc_plots/
- Directory containing plots from within the .html report, saved as
PDF
,PNG
andSVG
- Directory containing plots from within the .html report, saved as
For more information about how to use MultiQC reports, see http://multiqc.info.
Pipeline information
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
Output directory: results/pipeline_info
Contents:
-
pipeline_info/
-
Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.dot
/pipeline_dag.svg
. -
Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.csv
. -
Documentation for interpretation of results in HTML format:
results_description.html
.
-