Nascent Transcription Processing Pipeline
This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
The pipeline is built using Nextflow
and processes data using the following steps:
* fastq-dump - if needed, extract the fastq file[s] from a sample
* SeqKit/bbduk - flip reads (experiment specific) & trim reads for adapters/quality/length
* FastQC - read quality control
* MultiQC - aggregate report, describing results of the whole pipeline
* HISAT2 - map reads to the reference genome
* Samtools - convert the mapped reads as SAM files to BAM format
* Preseq - estimate complexity of the sample
* RSeQC - analyze read distributions, infer experiment (SE/PE, whether reads need to be flipped), & read duplication
* BBMap - analyze coverage
* bedtools - create both normalized and non-normalized coverage files in bedGraph format
* igvtools - create compressed files to visualize the sample in the Integrative Genomics Viewer (IGV)
* FastQ file to process, from the corresponding sample.
seqkit & bbduk
SeqKit is a toolkit for fasta and fastq file manipulation, used in the pipeline if the positive/negative strands need to be flipped (dependent on library prep protocol). BBDuk is trimming tool used to filter reads for adapters, read quality, and overall length after adapter removal.
* Trimmed FastQ file for each sample.
* Trimming details including adapters removed, percentages of reads removed that did not meet minimum quality/length
FastQC gives general quality metrics about your reads. It provides information about the quality score distribution across your reads, the per base sequence content (%T/A/G/C). You get information about adapter contamination and other overrepresented sequences.
For further reading and documentation see the FastQC help.
*NB:* The FastQC plots displayed in the MultiQC report shows both untrimmed and trimmed reads.
* FastQC report, containing quality metrics for your untrimmed raw fastq files & trimmed fastq files
* zip file containing the FastQC report, tab-delimited data file and plot images
MultiQC is a visualisation tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in within the report data directory.
The pipeline has special steps which allow the software versions used to be reported in the MultiQC output for future traceability.
* MultiQC report - a standalone HTML file that can be viewed in your web browser
* Directory containing parsed statistics from the different tools used in the pipeline
For more information about how to use MultiQC reports, see https://multiqc.info
HISAT2 is a sequence alignment tool to map the trimmed sequenced reads to the corresponding reference genome. Due to their size, the resulting sam files are not conserved after the pipeline has completed execution.
If the necessary indices for mapping are not provided/present, a separate process will build them first. This step can take a few minutes, however it should only be executed once.
Samtools is a suite of tools to handle format conversions, among other things, for high-throughput sequencing data. We also use Samtools to generate the list of chromosome sizes, if not provided for the desired reference genome.
* Mapped sample in BAM format
* Index for the
sample.trim.sorted.bam mapped sample in BAM format
* Overall mapping statistics
* File that contains number of uniquely mapped reads (not total multi-mapped). Used in normalization
Preseq plots the estimated complexity of a sample, and estimates future yields for complexity if the sample is sequenced at higher read depths.
* Curve generated based on number of unique reads vs. total reads sequenced
* Extrapolation of the c_curve that attempts to model the predicted number of unique reads if the sample was seqeunced to a greater depth
RSeQC provides a number of useful modules that can comprehensively evaluate high throughput sequence data. We use it on this pipeline to analyze read distributions.
* Relative distribution of reads relative to a gene reference file
BBMap includes a tool called
pileup, which analyzes the sequencing coverage for each sample.
* Histogram of read coverage over each chromosome
* Coverage stats broken down by chromosome including %GC, pos/neg read coverage, total coverage, etc.
bedtools is an extensive toolkit for BED and bedGraph format manipulation, like sorting, intersecting and joining these files. The files produced here are useful to be processed later using Tfit or dReg to find regions of active transcription, and transcription regulatory elements.
* Sample coverage file in bedGraph format
* Sample coverage file (positive strand only) in bedGraph format
* Sample coverage file (negative strand only) in bedGraph format
* Normalized sample coverage file in bedGraph format
* Normalized sample coverage file (positive strand only) in bedGraph format
* Normalized sample coverage file (negative strand only) in bedGraph format
* Sample coverage file (positive strand only) in BigWig format
* Sample coverage file (negative strand only) in BigWig format
* Normalized sample coverage file (positive strand only) in BigWig format
* Normalized sample coverage file (negative strand only) in BigWig format
igvtools is a commandline tool we use to produce a compressed version of the sample coverage file in order to visualize it on IGV more efficiently (with a significantly smaller memory footprint).
* Sample coverage file in TDF format