nf-core/fastquorum
Edit

Pipeline to produce consensus reads using unique molecular indexes/barcodes (UMIs)

consensusumiumisunique-molecular-identifier

This is the development version of the pipeline. View the latest v2.0.0 release.

Launch development version https://github.com/nf-core/fastquorum

Introduction

This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarizes results at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

Preprocessing
1. FastQC - Raw read QC
2. fgbio FastqToBam - Fastq to BAM, extracting UMIs
3. fgbio CorrectUmis - Correct non-random UMIs (optional)
4. BWA - Align the raw reads
Grouping
1. fgbio GroupReadsByUmi - Group raw reads by UMI (to identify reads from the same source molecule)
Consensus Reads
1. fgbio CallDuplexConsensusReads - Call duplex consensus reads for Duplex-Sequencing data
2. fgbio CallMolecularConsensusReads - Call single-strand consensus reads for non-Duplex-Sequencing data
3. BWA - Align the consensus reads
Consensus Filtering
1. fgbio FilterConsensusReads - Filter consensus reads
Quality Control and Metrics
1. fgbio CollectDuplexSeqMetrics - QC for Duplex-Sequencing data
2. MultiQC - Present raw read QC

Note: the High Throughput version of the pipeline performs consensus calling and consensus filtering in one step, with the alignment of consensus reads occuring after filtering. This significantly speeds up the workflow by eliminating an intermediate file (pre-filtered consensus reads) and reducing the number of consensus reads that need to be aligned (usually a minor speedup).

Preprocessing

FastQC

Output files

In the examples below meta.id will be library_id if defined and sample if library_id is not defined.

Output directory: {outdir}/preprocessing/fastqc/<meta.id>

*_fastqc.html: FastQC report containing quality metrics.
*_fastqc.zip: Zip archive containing the FastQC report, tab-delimited data file and plot images.

FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.

Note

The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.

fgbio FastqToBam

Output files

Output directory: {outdir}/preprocessing/fastqtobam/<meta.id>

’*.unmapped.bam`
- the unmapped BAM produced by fgbio FastqToBam.
- the RX SAM tag stores the raw bases for the reads unique molecular identifier (UMI)

fgbio CorrectUmis

Corrects UMI sequences against a known set of expected UMIs. Only runs for samples with a umi_file specified in the samplesheet.

Output files

Output directory: {outdir}/preprocessing/correctumis/<meta.id>

*.corrected.bam
- the unmapped BAM with corrected UMI sequences in the RX tag, produced by fgbio CorrectUmis
- original (uncorrected) UMI sequences are preserved in the OX tag
*.rejected.bam
- reads whose UMIs could not be corrected (for QC purposes; these reads are excluded from downstream processing)
*.correct-umis-metrics.txt
- metrics on UMI correction rates

BWA (raw reads)

Aligns the raw reads to the genome, and then template-coordinate sorts the reads in preparation for grouping.

Output files

Output directory: {outdir}/preprocessing/align_raw_bam/<meta.id>

’*.mapped.bam`
- the mapped BAM produced by:
  - aligning with bwa mem
  - reformatted by fgbio ZipperBam (to transfer any SAM tags from the unmapped BAM to the mapped BAM, since this is not carried forward by BWA)
  - template-coordinate sorted by samtools sort
- the RX SAM tag stores the raw bases for the reads unique molecular identifier (UMI)

Grouping

fgbio GroupReadsByUmi

Groups the reads by their UMI, identifying reads that originate from the same source molecule.

Output files

Output directory: {outdir}/grouping/groupreadsbyumi/<meta.id>

’*.mapped.bam`
- the group BAM produced by fgbio GroupReadsByUmi
- the MI SAM tag stores the molecular identifier for the read after grouping.
‘*.grouped-family-sizes.txt’
- the metric produced by fgbio GroupReadsByUmi that describes the distribution of tag family sizes observed during grouping (see this link).

Consensus Calling

The output for both fgbio CallDuplexConsensusReads and fgbio CallMolecularConsensusReads consensus calling tools.

Output files

Output directory: {outdir}/consensus_calling/called/<meta.id>

’*.cons.unmapped.bam`
- the BAM with consensus calls
- see this description of SAM tag added to consensus reads.

fgbio CallDuplexConsensusReads

Calls duplex consensus reads.

fgbio CallMolecularConsensusReads

Calls single-strand consensus reads.

Consensus Filtering

fgbio FilterConsensusReads

Filters consensus reads. Two kinds of filtering are performed:

Masking/filtering of individual bases in reads
Filtering out of reads (i.e. not writing them to the output file)

See fgbio FilterConsensusReads for more details.

Output files

Output directory: {outdir}/consensus_filtering/filtered/<meta.id>

’*.cons.filtered.bam`
- the BAM with filtered consensus calls produced by fgbio FilterConsensusReads

BWA (consensus reads)

Aligns the consensus reads to the genome.

Output files

Output directory: {outdir}/filtering/align_consensus_bam/<meta.id>

’*.mapped.bam`
- the mapped BAM produced by:
  - aligning with bwa mem
  - reformatted by fgbio ZipperBam (to transfer any SAM tags from the unmapped BAM to the mapped BAM, since this is not carried forward by BWA)
’*.mapped.bam.bai`
- the mapped BAM index (high-throughput mode only)

Quality Control and Metrics

fgbio CollectDuplexSeqMetrics

Collect duplex sequencing specific metrics.

Output files

Output directory: {outdir}/metrics/duplex_seq/<meta.id>

Metrics produced by fgbio CollectDuplexSeqMetrics:

*.family_sizes.txt - metrics on the frequency of different types of families of different sizes
*.duplex_family_sizes.txt- metrics on the frequency of duplex tag families by the number of observations from each strand
*.duplex_yield_metrics.txt- summary QC metrics produced using 5%, 10%, 15%…100% of the data
*.umi_counts.txt- metrics on the frequency of observations of UMIs within reads and tag families
*.duplex_qc.pdf- a series of plots generated from the preceding metrics files for visualization
*.duplex_umi_counts.txt- (optional) metrics on the frequency of observations of duplex UMIs within reads and tag families. This file is only produced if the --duplex-umi-counts option is used as it requires significantly more memory to track all pairs of UMIs seen when a large number of UMI sequences are present.

MultiQC

Output files

multiqc/
- multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
- multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
- multiqc_plots/: directory containing static images from the report in various formats.

MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Pipeline information

Output files

pipeline_info/
- Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter is used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
- Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

nf-core/fastquorumEdit

Introduction

Pipeline overview

Preprocessing

FastQC

fgbio FastqToBam

fgbio CorrectUmis

BWA (raw reads)

Grouping

fgbio GroupReadsByUmi

Consensus Calling

fgbio CallDuplexConsensusReads

fgbio CallMolecularConsensusReads

Consensus Filtering

fgbio FilterConsensusReads

BWA (consensus reads)

Quality Control and Metrics

fgbio CollectDuplexSeqMetrics

MultiQC

Pipeline information

nf-core/fastquorum
Edit