nf-core/seqinspector
Dedicated QC-only pipeline for sequencing data. The pipeline will run a (potentially large) set of QC tools and can output global and group specific Multiqc reports. The pipeline is targeting core facilities or research groups with larger sequencing throughput.
Introduction
This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and can generate output files from the following steps:
- fq - Linting of FASTQ files to check for formatting issues
- CheckQC - QC of an Illumina run
- Rundirparser - Parse rundir metadata from Illumina runs
- ToulligQC - Raw read QC for Oxford Nanopore runs
- SeqFu - Statistics for FASTA or FASTQ files
- Seqtk - Subsample a specific number of reads per sample
- FastQC - Raw read QC
- FASTQE - Raw read QC
- Fastp - Trimming and filtering of raw reads
- FastQ Screen - Mapping against a set of references for basic contamination QC
- BWA-MEM2_INDEX - Create BWA-MEM2 index of a chosen reference genome OR use pre-built index
- BWA-MEM2_MEM - Mapping reads against a chosen reference genome
- Samtools - Index BAM files with Samtools
- Picard CollectHsMetrics - Collect alignment QC metrics of hybrid-selection data
- Picard CollectMultipleMetrics - Combine BAM and BAI outputs for Picard
- Kraken2 - Phylogenetic assignment of reads using k-mers
- Krona - Interactive visualization of Kraken2 results
- MultiQC - Aggregate report describing results and QC from the whole pipeline
- Pipeline information - Report metrics generated during the workflow execution
FQ
Output files
reports/fq/[sample_id]/*.fq_lint.txt: Linting report for each FASTQ file containing information about the formatting of the FASTQ file and any potential issues.
Seqtk samples sequences by number.
CheckQC
Output files
reports/checkqc/[rundir]/checkqc_report.json: Reports sequencing metrics that are not fulfilled. Note that the CheckQC module in MultiQC currently does not support BCL Convert data, so if the report if based on data from that demultiplexer it will not be visualized in the MutliQC report. Results can be found in the output directory.
Rundirparser
Output files
reports/rundirparser/[rundir]/[rundir]_illumina_mqc.yml: Reports sequencing metrics. This is done via a custom script that can be found inbin/parse_illumina.pythat parses therunParameters.xmlfile. The resulting YAML file is formatted to be read by MultiQC. Results can be found in the output directory.
ToulligQC
Output files
toulligqc/*.data: ToulligQC output text file containing log information and all analysis results*.html: ToulligQC html report file
ToulligQC is dedicated to the QC analyses of Oxford Nanopore runs. This software is written in Python and developped by the GenomiqueENS core facility of the Institute of Biology of the Ecole Normale Superieure (IBENS).
SeqFu
Output files
reports/seqfu/[sample_id]/*.tsv: Tab-separated file containing quality metrics.*_mqc.txt: File containing the same quality metrics as the TSV file, ready to be read by MultiQC.
SeqFu is general-purpose program to manipulate and parse information from FASTA/FASTQ files, supporting gzipped input files. Includes functions to interleave and de-interleave FASTQ files, to rename sequences and to count and print statistics on sequence lengths. In this pipeline, the seqfu stats module is used to produce general quality metrics statistics.
Seqtk
Output files
seqtk/*_fastq: FastQ file after being subsampled to the sample_size value.
Seqtk samples sequences by number.
FastQC
Output files
reports/fastqc/[sample_id]/*_fastqc.html: FastQC report containing quality metrics.*_fastqc.zip: Zip archive containing the FastQC report, tab-delimited data file and plot images.
FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.
FASTQE
Output files
reports/fastqe/[sample_id]/*.tsv: FASTQE report containing quality metrics in emoji.
FASTQE Compute quality stats for FASTQ files and print those stats as emoji… for some reason.
FastP
FastP is a tool designed to provide all-in-one preprocessing for FastQ files and as such is used for trimming and splitting. The resulting trimmed files are not published. We only keep the reports for MultiQC and the pipeline report.
Output files
reports/fastp/[sample_id]/*.fastp.html: FastP HTML report.*.fastp.json: FastP report containing quality metrics in JSON format.*.fastp.log: FastP log file containing quality metrics.
FastQ Screen
Output files
reports/fastqscreen/[sample_id]/*_screen.html: Interactive graphical report.*_screen.png: Static graphical report.*_screen.txt: Text-based report.
FastQ Screen allows you to set up a standard set of references against which all of your samples can be mapped. Your references might contain the genomes of all of the organisms you work on, along with PhiX, vectors or other contaminants commonly seen in sequencing experiments.
To use FastQ Screen, this pipeline requires a .csv detailing:
- the working name of the reference
- the name of the aligner used to generate its index (which is also the aligner and index used by the tool)
- the file basename of the reference and its index (e.g. the reference
genome.faand its indexgenome.bt2have the basenamegenome) - the path to a dir where the reference and index files both reside.
See assets/example_fastq_screen_references.csv for example.
The .csv is provided as a pipeline parameter fastq_screen_references and is used to construct a FastQ Screen configuration file within the context of the process work directory in order to properly mount the references.
BWAMEM2_INDEX
Output files
Generates the full set of bwamem2 indexes:
bwamem2_index/*.fa*.fa.amb*.fa.ann*.fa.bwt*.fa.pac
BWAMEM2_MEM
BWA-mem2 is a tool next version of bwa-mem for mapping sequencies with low divergence against a reference genome with increased processing speed (~1.3-3.1x). Aligned reads are then potentially filtered and coordinate-sorted using samtools.
Output files
bwamem2/*.bam: The original BAM file containing read alignments to the reference genome.*.bam.bai: BAM index files
Samtools
Output files
samtools_faidex*.fa.fai*.fa.fai
Picard CollectHSmetrics
Output files
reports/picard_collecthsmetrics/[sample_id]/*.coverage_metrics: Tab-separated file containing quality metrics for hybrid-selection data.
Picard_collecthsmetrics is a tool to collect metrics on the aligment SAM/BAM files that are specific for sequence datasets generated through hybrid-selection (mostly used to capture exon-specific sequences for targeted sequencing).
Picard CollectMultipleMetrics
Output files
reports/picard_collectmultiplemetrics/[sample_id]/*.CollectMultipleMetrics.alignment_summary_metrics*.CollectMultipleMetrics.base_distribution_by_cycle_metrics*.CollectMultipleMetrics.base_distribution_by_cycle.pdf*.CollectMultipleMetrics.quality_by_cycle_metrics*.CollectMultipleMetrics.quality_by_cycle.pdf*.CollectMultipleMetrics.quality_distribution.pdf*.CollectMultipleMetrics.read_length_histogram.pdf
Kraken2
Kraken is a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences. Kraken examines the k-mers within a query sequence and uses the information within those k-mers to query a database. That database maps -mers to the lowest common ancestor (LCA) of all genomes known to contain a given k-mer.
kraken2/<sample>.kraken2.report.txt: A report containing information on the phylogenetic assignment of reads in a given sample.<db_name>/<sample_id>_<db_name>.classified.fastq.gz: FASTQ file containing all reads that had a hit against a reference in the database for a given sample<sample_id>_<db_name>.unclassified.fastq.gz: FASTQ file containing all reads that did not have a hit in the database for a given sample<sample_id>_<db_name>.classifiedreads.txt: A list of read IDs and the hits each read had against each database for a given sample
The main taxonomic classification file from Kraken2 is the *report.txt file. It gives you the most information for a single sample.
You will only receive the .fastq and *classifiedreads.txt file if you supply --kraken2_save_reads and/or --kraken2_save_readclassifications parameters to the pipeline.
Krona
Krona allows the exploration of (metagenomic) hierarchical data with interactive zooming, multi-layered pie charts.
Krona charts will be generated by the pipeline for supported tools (Kraken2, Centrifuge, Kaiju, and MALT)
krona/<tool_name>_<db_name>.html: per-tool/per-database interactive HTML file containing hierarchical piecharts
The resulting HTML files can be loaded into your web browser for exploration. Each file will have a dropdown to allow you to switch between each sample aligned against the given database of the tool.
MultiQC
nf-core/seqinspector will generate the following MultiQC reports:
- one global reports including all the samples listed in the samplesheet
- one group report per unique tag. These reports compile samples that share the same tag.
Output files
multiqc/global_reportmultiqc_report.html: a standalone HTML file that can be viewed in your web browser.multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.multiqc_plots/: directory containing static images from the report in various formats.
group_reportstag1/multiqc_report.htmlmultiqc_data/multiqc_plots/
tag2/multiqc_report.htmlmultiqc_data/multiqc_plots/
- …
MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.
Pipeline information
Output files
pipeline_info/- Reports generated by Nextflow:
execution_report.html,execution_timeline.html,execution_trace.txtandpipeline_dag.dot/pipeline_dag.svg. - Reports generated by the pipeline:
pipeline_report.html,pipeline_report.txtandsoftware_versions.yml. Thepipeline_report*files will only be present if the--email/--email_on_failparameter’s are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv. - Parameters used by the pipeline run:
params.json.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.