nf-core/detaxizer
A pipeline to identify (and remove) certain sequences from raw genomic data. Default taxa to identify (and remove) are Homo and Homo sapiens. Removal is optional.
Introduction
This document describes the output produced by the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. <sample>
is a placeholder for the real sample name provided in the samplesheet.csv
.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- FastQC - Raw read QC - Output not in the results directory by default
- fastp - (Optional) preprocessing of raw reads
- kraken2 - Classification of the (preprocessed) reads and extracting the searched taxa from the results
- bbduk - Classification of the (preprocessed) reads
- classification - Preparation of the read IDs for filtering and/or validation
- blastn - (Optional) validation of the reads classified as the searched taxa and extracting ids of validated reads
- filter - (Optional) filtering of the raw or preprocessed reads using either the read ids from kraken2 and/or bbduk output or blastn output
- summary - The summary of the classification and the optional validation
- MultiQC - Aggregate report describing results and QC from the whole pipeline
- Pipeline information - Report metrics generated during the workflow execution
Only the filtering results, the summary, MultiQC and pipeline information are shown by default in the results folder. Also, if the output from the filter are classified using kraken2, a kraken2 folder, containing a filtered/
and a removed/
folder, will be shown.
FastQC
FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.
fastp
fastp performs preprocessing of the reads (adapter/quality trimming). For details of the output, please refer to this site.
Output files
fastp/
: Contains the output from the preprocessing step.<sample>_longReads/
: If long reads are present in yoursamplesheet.csv
this folder is generated containing the fastp-report.<sample>_longReads.fastp.html
: The report on the preprocessing step.<sample>_longReads.fastp.json
: The data on the preprocessing step in.json
-format.
<sample>_R1/
: If single-end short reads are present in yoursamplesheet.csv
this folder is generated.- same pattern as in
<sample>_longReads/
with the prefix<sample>_R1.fastp.*
.
- same pattern as in
<sample>/
: For paired-end short reads in yoursamplesheet.csv
this folder is generated.- same pattern as in
<sample>_longReads/
with the prefix<sample>.fastp.*
.
- same pattern as in
kraken2
kraken2 classifies the reads. The important files are *.classifiedreads.txt
, *.kraken2.report.txt
, isolated/*.classified.txt
and summary/*.kraken2_summary.tsv
.
<sample>
can be replaced by <sample>_longReads
, <sample>_R1
or left as <sample>
depending on the cases mentioned in fastp.
Output files
kraken2/
: Contains the output from the kraken2 classification steps.filtered/
: Contains the classification of the filtered reads (post-filtering).<sample>.classifiedreads.txt
: The whole kraken2 output for filtered reads.<sample>.kraken2.report.txt
: Statistics on how many reads where assigned to which taxon/taxonomic group in the filtered reads.
isolated/
: Contains the isolated lines and ids for the taxon/taxa mentioned in thetax2filter
parameter.<sample>.classified.txt
: The whole kraken2 output for the taxon/taxa mentioned in thetax2filter
parameter.<sample>.ids.txt
: The ids from the whole kraken2 output assigned to the taxon/taxa mentioned in thetax2filter
parameter.
removed/
: Contains the classification of the removed reads (post-filtering).<sample>.classifiedreads.txt
: The whole kraken2 output for removed reads.<sample>.kraken2.report.txt
: Statistics on how many reads where assigned to which taxon/taxonomic group in the removed reads.
summary/
: Summary of the kraken2 process.<sample>.kraken2_summary.tsv
: Contains two three columns, column 1 is the sample name, column 2 the amount of lines in the untouched kraken2 output and column 3 the amount of lines in the isolated output.
taxonomy/
: Contains the list of taxa to filter/to assess for.taxa_to_filter.txt
: Contains the taxon ids of all taxa to assess the data for or to filter out.
<sample>.classifiedreads.txt
: The whole kraken2 output for all reads.<sample>.kraken2.report.txt
: Statistics on how many reads where assigned to which taxon/taxonomic group.
bbduk
bbduk classifies the reads. The important files are *.bbduk.log
and ids/*.bbduk.txt
. <sample>
can be replaced by <sample>_longReads
, <sample>_R1
or left as <sample>
depending on the cases mentioned in fastp.
Output files
bbduk/
: Contains the output from the bbduk classification step.ids/
: Contains the files with the IDs classified by bbduk.<sample>.bbduk.txt
: Contains the classified IDs per sample.
<sample>.bbduk.log
: Contains statistics on the bbduk run.
classification
Either the merged IDs from bbduk and kraken2 or the ones produced by one of the tools are shown in this folder. Also, the summary files of the classification step are shown.
Output files
classification/
: Contains the results and the summaries of the classification step.ids/
: Contains either the merged ID files of the classification step or the ones from one classification tool.<sample>.ids.txt
: Contains the classified IDs.
summary/
: Contains the summary files of either the classification step or the ones from one classification tool. -<sample>.classification_summary.tsv
: Contains the count of reads classified.
blastn
blastn can validate the reads classified by kraken2 as the taxon/taxa to be assessed/to be filtered. To reduce computational burden only the highest scoring hit per input sequence is returned. If in any case one would need more information this can be done via the max_hsps
- and max_target_seqs
-flags in the modules.config
file.
Output files
blast/
filtered_ident_cov/
: The read ids and statistics of the reads which were validated by blastn to be the taxon/taxa to assess/to filter.<sample>_R1.identcov.txt
: File is present for single-end and paired-end short reads.<sample>_R2.identcov.txt
: File is present for paired-end short reads.<sample>_longReads.identcov.txt
: File is present for long reads.
summary/
: Short overview of the amount of reads which were validated by blastn.<sample>.blastn_summary.tsv
:<sample>
can be one of two options for this file. Either stay as<sample>
or be<sample>_longReads
for long reads.
filter
In this folder, the filtered and re-renamed reads can be found. This result has to be carefully examined using the other information in the results folder.
Output files
filter/
: Folder containing the filtered and re-renamed reads.filtered/
: Folder containing the decontaminated reads<sample>_filtered.fastq.gz
: The filtered reads,<sample>
can stay as<sample>
for single-end short reads, take the pattern<sample>_{R1,R2}
for paired-end reads and<sample>_longReads
for long reads.
removed/
: Folder containing the removed reads (optional)<sample>_removed.fastq.gz
: The removed reads,<sample>
can stay as<sample>
for single-end short reads, take the pattern<sample>_{R1,R2}
for paired-end reads and<sample>_longReads
for long reads.
summary
The summary file lists all statistics of kraken2 and/or bbduk (and optionally blastn) per sample. It is a combination of the summary files of the classification step and blastn and can be used for a quick overview of the pipeline run. By default, only the summary of the classification step is shown.
classified with * | blastn_unique_ids | blastn_lines | filteredblastn_unique_ids | filteredblastn_lines | |
---|---|---|---|---|---|
<sample> (For short reads it is the same as in the samplesheet.csv , for long reads it is <sample>_longReads ) | Number of IDs classified in the classification step | Number of unique IDs in blastn output, should be the same as blastn_lines | Number of lines in the blastn output | Number of IDs in the blastn output after the filtering for identity and coverage, should be the same as filteredblastn_lines | Number of lines in the blastn output after the filtering for identity and coverage |
Output files
summary/
: Folder containing the summary.summary.tsv
: File containing the summary in the format stated above.
MultiQC
Output files
multiqc/
multiqc_report.html
: a standalone HTML file that can be viewed in your web browser.multiqc_data/
: directory containing parsed statistics from the different tools used in the pipeline.multiqc_plots/
: directory containing static images from the report in various formats.
MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.
Pipeline information
Output files
pipeline_info/
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.dot
/pipeline_dag.svg
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.yml
. Thepipeline_report*
files will only be present if the--email
/--email_on_fail
parameter’s are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
. - Parameters used by the pipeline run:
params.json
.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.