Introduction

This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

  • MultiQC - Aggregate report describing results and QC from the whole pipeline
  • Pipeline information - Report metrics generated during the workflow execution
  • Bracken - Database files for Brakcen
  • Centrifuge - Database files for Centrifuge
  • DIAMOND - Database files for DIAMOND
  • Kaiju - Database files for Kaiju
  • Kraken2 - Database files for Kraken2
  • MALT - Database files for MALT

MultiQC

Output files
  • multiqc/
    • multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
    • multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
    • multiqc_plots/: directory containing static images from the report in various formats.

MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
    • Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

Bracken

Bracken(Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.

Output files
  • bracken/
    • <db_name>/
      • database100mers.kmer_distrib: Bracken kmer distribution file
      • database100mers.kraken: Bracken index file
      • database.kraken: Bracken database file
      • hash.k2d: Kraken2 hash database file
      • opts.k2d: Kraken2 opts database file
      • taxo.k2d: Kraken2 taxo database file
      • library/: Intermediate Kraken2 directory containing FASTAs and related files of added genomes
      • taxonomy/: Intermediate Kraken2 directory containing taxonomy files of added genomes
      • seqid2taxid.map: Intermediate Kraken2 file containing taxonomy files of added genomes

Note that all intermediate files are required for Bracken2 database, even if Kraken2 itself only requires the *.k2d files.

The resulting <db_name>/ directory can be given to Bracken itself with bracken -d <your_database_name> etc.

Centrifuge

Centrifuge is a very rapid and memory-efficient system for the classification of DNA sequences from microbial samples.

Output files
  • diamond/
    • <database>.*.cf: Centrifuge database files

A directory and cf files can be given to the Centrifuge command with centrifuge -x /<path>/<to>/<cf_files_basename> etc.

Diamond

DIAMOND is a accelerated BLAST compatible local sequence aligner particularly used for protein alignment.

Output files
  • diamond/
    • <database>.dmnd: DIAMOND dmnd database file

The dmnd file can be given to one of the DIAMOND alignment commands with diamond blast<x/p> -d <your_database>.dmnd etc.

Kaiju

Kaiju is a fast and sensitive taxonomic classification for metagenomics utilising nucletoide to protein translations.

Output files
  • kaiju/
    • <database_name>.fmi: Kaiju FMI index file

The fmi file can be given to Kaiju itself with kaiju -f <your_database>.fmi etc.

Kraken2

Kraken2 is a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds.

Output files
  • kraken2/
    • <db_name>/
      • hash.k2d: Kraken2 hash database file
      • opts.k2d: Kraken2 opts database file
      • taxo.k2d: Kraken2 taxo database file
      • library/: Intermediate directory containing FASTAs and related files of added genomes (only present if --build_bracken or --kraken2_keepintermediate supplied)
      • taxonomy/: Intermediate directory containing taxonomy files of added genomes (only present if --build_bracken or --kraken2_keepintermediate supplied)
      • seqid2taxid.map: Intermediate file containing taxonomy files of added genomes (only present if --build_bracken or --kraken2_keepintermediate supplied)

The resulting <db_name>/ directory can be given to Kraken2 itself with kraken2 --db <your_database_name> etc.

MALT

MALT is a fast replacement for BLASTX, BLASTP and BLASTN, and provides both local and semi-global alignment capabilities.

Output files
  • malt/
    • malt_index/: directory containing MALT index files

The malt_index directory can be given to MALT itself with malt-run --index <your_database>/ etc.