CLIP sequencing analysis pipeline for QC, pre-mapping, genome mapping, UMI deduplication, and multiple peak-calling options.
This document describes the output produced by the pipeline. The plots are taken from the MultiQC report, which summarises results at the end of the pipeline and also includes CLIP-specific summary metrics.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
- Crosslink identification
- Peak calling
- Motif finding
- CLIP summary and quality control
FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences.
For further reading and documentation see the FastQC help pages.
*_fastqc.html: FastQC report containing quality metrics for your untrimmed raw fastq files.
*_fastqc.zip: Zip archive containing the FastQC report, tab-delimited data file and plot images.
NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.
Optionally, UMI-tools is used to extract the UMI from the beginning of the read and append it to the read name using
_ as a delimiter. This has often already been done as part of demultiplexing an Illumina sequencing run.
NB As each FASTQ contains only one sample, the whole barcode (experimental and UMI) can be treated as the UMI as described here.
sample.umi.fastq.gz: FASTQ file of reads with UMIs extracted
Cutadapt removes adapters and also quality trims the data. By default the pipeline trims the Illumina universal adapter sequences and filters out reads that are shorter that 12 nt after trimming
sample.trimmed.fastq.gz: FASTQ file after trimming
sample.cutadapt.log: Cutadapt log file
For CLIP data analysis it is often important to pre-map to rRNA and tRNA sequences. FASTA files for a number of organisms are provided as part of the pipeline. Bowtie 2 is used to identify these reads.
sample.premapped.bam: BAM file of reads mapped to the premapping index
sample.premapped.bam.bai: BAI file for BAM
sample.premap.log: Premapping (Bowtie 2) log file
sample.unmapped.fastq.gz: FASTQ file of reads that do not map to the premapping index that is passed to the next step of the pipeline.
STAR is used to align to the genome. Importantly, soft-clipping of the 5’ end of the read is prevented, ensuring the crosslink position can be correctly identified.
sample.Aligned.sortedByCoord.bam: BAM file of reads mapped to the genome
sample.Aligned.sortedByCoord.bam.bai: BAI file for BAM
sample.Log.final.out: Alignment (STAR) log file
UMI-tools is used for UMI aware PCR deduplication. The directional method is used.
sample.dedup.bam: BAM file of deduplicated reads
sample.dedup.bam.bai: BAI file for BAM
sample.log: Deduplication (UMI-tools) log file
BEDTools is used to identify the crosslinks from the BAM files. The crosslink BED files are single-nucleotide resolution (i.e. each entry is 1 nt wide) and the score is the number of crosslinks at that position. In the crosslink BEDGRAPH files, a positive score indicates a crosslink on the positive strand and a negative score one on the negative strand.
sample.xl.bed.gz: BED file of crosslinks
sample.xl.bedgraph.gz: BEDGRAPH file of crosslinks
The following peak callers are currently provided in the pipeline:
The user can specify which one(s) are run. Filenames with the default run parameters are shown below, but are adjusted by the pipeline according to the parameters specified.
sample.3nt.sigxl.bed.gz: BED file of significant crosslink positions using a 3 nt half-window setting
sample.3nt_3nt.peaks.bed.gzBED file of peaks using a 3 nt half window and a 3 nt merge window
sample.10_200nt_2.peaks.bed.gz: BED file of peaks using a minimum value/score of 10, a maximum cluster length of 200 and a minimum density increase of 2.
sample.sigxl.bed.gz: BED file of significant crosslink sites
sample.8nt.peaks.bed.gz: BED file of peaks using a merge distance of 8 nt
sample.3nt_3nt.peaks.bed.gz: BED file of peaks using a bin size of 3 and a cluster distance of 3
DREME is used for basic motif calling used peaks. By default the sequence from the region +/- 20 nt around the crosslink site is provided as input for DREME.
sample_dreme/: Directory containing DREME output files:
CLIP summary and quality control
The pipeline uses a custom script to provide CLIP-specific summary metrics that are plotted in the MultiQC report.
This section plots the counts/percentages of reads mapped to the premapping index, mapped to the genome, and that remain unmapped.
This section plots three measures from the UMI-based PCR deduplication.
- Reads shows the number of reads before and after deduplication.
- Ratios shows the PCR deduplication ratio.
- Mean UMIs shows the mean number of unique UMIs per position.
This section plots two measures from crosslink identification.
- Counts shows the number of crosslinks and crosslink sites.
- Ratios shows the ratio of crosslinks to crosslink sites.
This sections plots three peak-calling metrics (if peak calling has been performed) to enable comparison of different tools and optimisation of specific peak-caller parameters.
- Crosslinks in peaks shows the total percentage of crosslinks within peaks.
- Crosslink sites in peaks shows the total percentage of crosslink sites within peaks.
- Peak-crosslink coverage shows the total percentage of nucleotides within peaks that are covered by a crosslink site.
*.tsv: TSV files containing derived metrics from the pipeline outputs used to produce the MultiQC CLIP summary metric plots.
Preseq is used to estimate the complexity of the sequenced library.
sample.ccurve.txt: TXT file of complexity curve Preseq output.
sample.command.log: Preseq log file
RSeQC is used to calculate how deduplicated mapped reads are distributed over genomic features.
sample.read_distribution.txt: TXT file of
read_distribution.pyoutput from RSeQC.
MultiQC is a visualization tool that generates a single HTML report summarizing all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability.
For more information about how to use MultiQC reports, see https://multiqc.info.
multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
multiqc_plots/: directory containing static images from the report in various formats.
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
- Reports generated by Nextflow:
- Reports generated by the pipeline:
- Documentation for interpretation of results in HTML format:
- Reports generated by Nextflow: