nf-core/nascent
Nascent Transcription Processing Pipeline
1.0
). The latest
stable release is
2.2.0
.
Output
This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- fastq-dump - if needed, extract the fastq file[s] from a sample
- SeqKit/bbduk - flip reads (experiment specific) & trim reads for adapters/quality/length
- FastQC - read quality control
- MultiQC - aggregate report, describing results of the whole pipeline
- HISAT2 - map reads to the reference genome
- Samtools - convert the mapped reads as SAM files to BAM format
- Preseq - estimate complexity of the sample
- RSeQC - analyze read distributions, infer experiment (SE/PE, whether reads need to be flipped), & read duplication
- BBMap - analyze coverage
- bedtools - create both normalized and non-normalized coverage files in bedGraph format
- igvtools - create compressed files to visualize the sample in the Integrative Genomics Viewer (IGV)
fastqdump
fastq-dump decompresses an SRR file obtained from the Gene Expression Omnibus (GEO) database. This will produce one or two fastq files (in the case of paired-end reads).
Output directory: results/fastq-dump
sample.fastq
- FastQ file to process, from the corresponding sample.
seqkit & bbduk
SeqKit is a toolkit for fasta and fastq file manipulation, used in the pipeline if the positive/negative strands need to be flipped (dependent on library prep protocol). BBDuk is trimming tool used to filter reads for adapters, read quality, and overall length after adapter removal.
Output directory: results/bbduk, qc/trimstats
sample.trim.fastq
- Trimmed FastQ file for each sample.
{refstats,trimstats,ehist}.txt
- Trimming details including adapters removed, percentages of reads removed that did not meet minimum quality/length
FastQC
FastQC gives general quality metrics about your reads. It provides information about the quality score distribution across your reads, the per base sequence content (%T/A/G/C). You get information about adapter contamination and other overrepresented sequences.
For further reading and documentation see the FastQC help.
NB: The FastQC plots displayed in the MultiQC report shows both untrimmed and trimmed reads.
Output directory: results/qc
sample_fastqc.html
- FastQC report, containing quality metrics for your untrimmed raw fastq files & trimmed fastq files
zips/sample_fastqc.zip
- zip file containing the FastQC report, tab-delimited data file and plot images
MultiQC
MultiQC is a visualisation tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in within the report data directory.
The pipeline has special steps which allow the software versions used to be reported in the MultiQC output for future traceability.
Output directory: results/multiqc
Project_multiqc_report.html
- MultiQC report - a standalone HTML file that can be viewed in your web browser
Project_multiqc_data/
- Directory containing parsed statistics from the different tools used in the pipeline
For more information about how to use MultiQC reports, see https://multiqc.info
hisat2
HISAT2 is a sequence alignment tool to map the trimmed sequenced reads to the corresponding reference genome. Due to their size, the resulting sam files are not conserved after the pipeline has completed execution.
If the necessary indices for mapping are not provided/present, a separate process will build them first. This step can take a few minutes, however it should only be executed once.
samtools
Samtools is a suite of tools to handle format conversions, among other things, for high-throughput sequencing data. We also use Samtools to generate the list of chromosome sizes, if not provided for the desired reference genome.
Output directory: results/mapped/bams
sample.trim.sorted.bam
- Mapped sample in BAM format
sample.trim.sorted.bam.bai
- Index for the
sample.trim.sorted.bam
mapped sample in BAM format
- Index for the
Output directory: results/qc/mapstats
sample.trim.sorted.bam.flagstat
- Overall mapping statistics
sample.trim.sorted.bam.millionsmapped
- File that contains number of uniquely mapped reads (not total multi-mapped). Used in normalization
preseq
Preseq plots the estimated complexity of a sample, and estimates future yields for complexity if the sample is sequenced at higher read depths.
Output directory: results/qc/preseq
sample.trim.c_curve.txt
- Curve generated based on number of unique reads vs. total reads sequenced
sample.trim.lc_extrap.txt
- Extrapolation of the c_curve that attempts to model the predicted number of unique reads if the sample was seqeunced to a greater depth
rseqc
RSeQC provides a number of useful modules that can comprehensively evaluate high throughput sequence data. We use it on this pipeline to analyze read distributions.
Output directory: results/qc/rseqc
sample.trim.read_dist.txt
- Relative distribution of reads relative to a gene reference file
pileup
BBMap includes a tool called pileup
, which analyzes the sequencing coverage for each sample.
Output directory: results/qc/pileup
sample.trim.coverage.hist.txt
- Histogram of read coverage over each chromosome
sample.trim.coverage.stats.txt
- Coverage stats broken down by chromosome including %GC, pos/neg read coverage, total coverage, etc.
bedtools
bedtools is an extensive toolkit for BED and bedGraph format manipulation, like sorting, intersecting and joining these files. The files produced here are useful to be processed later using Tfit or dReg to find regions of active transcription, and transcription regulatory elements.
Output directory: results/mapped/bedgraphs
sample.trim.bedGraph
- Sample coverage file in bedGraph format
sample.trim.pos.bedGraph
- Sample coverage file (positive strand only) in bedGraph format
sample.trim.neg.bedGraph
- Sample coverage file (negative strand only) in bedGraph format
Output directory: results/mapped/rcc_bedgraphs
sample.trim.rcc.bedGraph
- Normalized sample coverage file in bedGraph format
sample.pos.trim.rcc.bedGraph
- Normalized sample coverage file (positive strand only) in bedGraph format
sample.neg.trim.rcc.bedGraph
- Normalized sample coverage file (negative strand only) in bedGraph format
Output directory: results/mapped/dreg_input
sample.trim.pos.rcc.bw
- Sample coverage file (positive strand only) in BigWig format
sample.trim.neg.rcc.bw
- Sample coverage file (negative strand only) in BigWig format
Output directory: results/mapped/rcc_bigwig
sample.trim.pos.rcc.bw
- Normalized sample coverage file (positive strand only) in BigWig format
sample.trim.neg.rcc.bw
- Normalized sample coverage file (negative strand only) in BigWig format
igvtools
igvtools is a commandline tool we use to produce a compressed version of the sample coverage file in order to visualize it on IGV more efficiently (with a significantly smaller memory footprint).
Output directory: results/mapped/tdfs
sample.trim.rpkm.tdf
- Sample coverage file in TDF format