nf-core/fastquorum
Pipeline to produce consensus reads using unique molecular indexes/barcodes (UMIs)
1.0.0
). The latest
stable release is
1.1.0
.
Introduction
This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- Preprocessing
- FastQC - Raw read QC
- fgbio FastqToBam - Fastq to BAM, extracting UMIs
- BWA - Align the raw reads
- Grouping
- fgbio GroupReadsByUmi - Group raw reads by UMI (to identify reads from the same source molecule)
- Consensus Reads
- fgbio CallDuplexConsensusReads - Call duplex consensus reads for Duplex-Sequencing data
- fgbio CallMolecularConsensusReads - Call single-strand consensus reads for non-Duplex-Sequencing data
- BWA - Align the consensus reads
- Consensus Filtering
- fgbio FilterConsensusReads - Filter consensus reads
- Quality Control and Metrics
- fgbio CollectDuplexSeqMetrics - QC for Duplex-Sequencing data
- MultiQC - Present raw read QC
Note: the High Throughput version of the pipeline performs consensus calling and consensus filtering in one step, with the alignment of consensus reads occuring after filtering. This significantly speeds up the workflow by eliminating an intermediate file (pre-filtered consensus reads) and reducing the number of consensus reads that need to be aligned (usually a minor speedup).
Preprocessing
FastQC
Output files
Output directory: {outdir}/preprocessing/fastqc/<sample>
FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.
The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.
fgbio FastqToBam
Output files
Output directory: {outdir}/preprocessing/fastqtobam/<sample>
- ‘*.unmapped.bam`
- the unmapped BAM produced by
fgbio FastqToBam
. - the
RX
SAM tag stores the raw bases for the reads unique molecular identifier (UMI)
- the unmapped BAM produced by
BWA (raw reads)
Aligns the raw reads to the genome, and then template-coordinate sorts the reads in preparation for grouping.
Output files
Output directory: {outdir}/preprocessing/align_raw_bam/<sample>
- ‘*.mapped.bam`
- the mapped BAM produced by:
- aligning with
bwa mem
- reformatted by
fgbio ZipperBam
(to transfer any SAM tags from the unmapped BAM to the mapped BAM, since this is not carried forward by BWA) - template-coordinate sorted by
samtools sort
- aligning with
- the
RX
SAM tag stores the raw bases for the reads unique molecular identifier (UMI)
- the mapped BAM produced by:
Grouping
fgbio GroupReadsByUmi
Groups the reads by their UMI, identifying reads that originate from the same source molecule.
Output files
Output directory: {outdir}/grouping/groupreadsbyumi/<sample>
- ‘*.mapped.bam`
- the group BAM produced by
fgbio GroupReadsByUmi
- the
MI
SAM tag stores the molecular identifier for the read after grouping.
- the group BAM produced by
- ‘*.grouped-family-sizes.txt’
- the metric produced by
fgbio GroupReadsByUmi
that describes the distribution of tag family sizes observed during grouping (see this link).
- the metric produced by
Consensus Calling
The output for both fgbio CallDuplexConsensusReads
and fgbio CallMolecularConsensusReads
consensus calling tools.
Output files
Output directory: {outdir}/consensus_calling/called/<sample>
- ‘*.cons.unmapped.bam`
- the BAM with consensus calls
- see this description of SAM tag added to consensus reads.
fgbio CallDuplexConsensusReads
Calls duplex consensus reads.
fgbio CallMolecularConsensusReads
Calls single-strand consensus reads.
Consensus Filtering
fgbio FilterConsensusReads
Filters consensus reads. Two kinds of filtering are performed:
- Masking/filtering of individual bases in reads
- Filtering out of reads (i.e. not writing them to the output file)
See fgbio FilterConsensusReads
for more details.
Output files
Output directory: {outdir}/consensus_filtering/filtered/<sample>
- ‘*.cons.filtered.bam`
- the BAM with filtered consensus calls produced by
fgbio FilterConsensusReads
- the BAM with filtered consensus calls produced by
BWA (consensus reads)
Aligns the consensus reads to the genome.
Output files
Output directory: {outdir}/filtering/align_consensus_bam/<sample>
- ‘*.mapped.bam`
- the mapped BAM produced by:
- aligning with
bwa mem
- reformatted by
fgbio ZipperBam
(to transfer any SAM tags from the unmapped BAM to the mapped BAM, since this is not carried forward by BWA)
- aligning with
- the mapped BAM produced by:
- ‘*.mapped.bam.bai`
- the mapped BAM index (high-throughput mode only)
Quality Control and Metrics
fgbio CollectDuplexSeqMetrics
Collect duplex sequencing specific metrics.
Output files
Output directory: {outdir}/metrics/duplex_seq/<sample>
Metrics produced by fgbio CollectDuplexSeqMetrics
:
*.family_sizes.txt*
- metrics on the frequency of different types of families of different sizes*.duplex_family_sizes.txt*
- metrics on the frequency of duplex tag families by the number of observations from each strand*.duplex_yield_metrics.txt*
- summary QC metrics produced using 5%, 10%, 15%…100% of the data*.umi_counts.txt*
- metrics on the frequency of observations of UMIs within reads and tag families*.duplex_qc.pdf*
- a series of plots generated from the preceding metrics files for visualization*.duplex_umi_counts.txt*
- (optional) metrics on the frequency of observations of duplex UMIs within reads and tag families. This file is only produced if the--duplex-umi-counts
option is used as it requires significantly more memory to track all pairs of UMIs seen when a large number of UMI sequences are present.
MultiQC
Output files
multiqc/
multiqc_report.html
: a standalone HTML file that can be viewed in your web browser.multiqc_data/
: directory containing parsed statistics from the different tools used in the pipeline.multiqc_plots/
: directory containing static images from the report in various formats.
MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.
Pipeline information
Output files
pipeline_info/
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.dot
/pipeline_dag.svg
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.yml
. Thepipeline_report*
files will only be present if the--email
/--email_on_fail
parameter is used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
. - Parameters used by the pipeline run:
params.json
.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.