Introduction

This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

Input Check

bgzip

bgzip compresses a file into a series of small ‘BGZF’ blocks in a similar manner to, and compatible with, gzip. This allows indexes to be built against the compressed file and used to retrieve portions of the data without having to decompress the entire file.

Output files
  • tabix_bgzip/
    • <INPUT_FASTA>.gz: The bgzip compressed input FASTA file.
    • <INPUT_FASTA>.community.[0-9]{1,}.fa.gz: A bgzipped compressed community FASTA file. Only appears when --communities is provided.

samtools faidx

samtools faidx indexes or queries regions from a FASTA file.

Output files
  • samtools_faidx/
    • <INPUT_FASTA>.fai: FASTA index of the file provided via --input.
    • <INPUT_FASTA>.gzi: Compressed FASTA index of the file provided via --input.
    • <INPUT_FASTA>.community.[0-9]{1,}.fa.gz.fai: FASTA index of a community FASTA file. Only appears when --communities is provided.
    • <INPUT_FASTA>.community.[0-9]{1,}.fa.gz.gzi: Compressed FASTA index of a community FASTA file. Only appears when --communities is provided.

Community Detection

paf2net

paf2net is python script that projects wfmash’s PAF mappings (the implied overlap and containment graph) into an edge list, a list of edge weights, and an ‘id to sequence name’ map.

Output files
  • paf2net/
    • <INPUT_FASTA>.paf.vertices.id2name.txt: TXT file with a mapping of vertex identifiers to sequence names. Only appears when --communities is provided.
    • <INPUT_FASTA>.paf.edges.weights.txt: TXT file with the weights of the edges. Only appears when --communities is provided.
    • <INPUT_FASTA>.paf.edges.list.txt: TXT file listing all edge connections. Only appears when --communities is provided.

net2communities

net2communities is a python script that detects communities by applying the Leiden algorithm (Trag et al., Nature 2019).

Output files
  • net2communities/
    • <INPUT_FASTA>.community.[0-9](1,).txt: TXT file with the sequence names of the community. Only appears when --communities is provided.

extract communities

extract communities is a locally modified module of samtools faidx. The original module only allowed for the indexing of FASTA files, but not for the extraction of single sequences.

Output files
  • extract_communities/
    • <INPUT_FASTA>.community.[0-9](1,).txt.fa: A community FASTA file. Only appears when --communities is provided.

wfmash

wfmash is an aligner for pangenomes based on sparse homology mapping and wavefront inception. In this pipeline it is used in various processes and ways.

Output files
  • wfmash/
    • <INPUT_FASTA>.paf: PAF file containing CIGAR strings of all pairwise alignments.
    • <INPUT_FASTA>.community.[0-9]{1,}.paf: Community PAF file containing CIGAR strings of all pairwise alignments of the specific community. Only appears when --communities is provided.

wfmash map community

Here wfmash was applied in approximate mappping mode in order to create all-all alignments of all input sequences for their subsequent community discovery.

Output files
  • wfmash_map_community/
    • <INPUT_FASTA>.paf: PAF file containing approximate mappings of all pairwise alignments.

wfmash map

Here wfmash was applied in approximate mappping mode in order to create all-all alignments of all input sequences. We can then split the base pair level alignment problem into several equal problem sizes with split approx mappings in chunks.

Output files
  • wfmash_map/
    • <INPUT_FASTA>.paf: PAF file containing approximate mappings of all pairwise alignments.
    • <INPUT_FASTA>.community.[0-9]{1,}.paf: Community PAF file containing approximate mappings of all pairwise alignments of the specific community. Only appears when --communities is provided.

split approx mappings in chunks

split approx mappings in chunks is a python script that takes the approximate mappings, weighs each mapping by computing its length * (1 - estimated identity), then creates N new files where the mapping sets have a similar sum of weights.

Output files
  • wfmash_map/
    • <INPUT_FASTA>.paf.chunk_[0-9]{1,}.paf: PAF file containing base level alignments of a specific chunk.
    • <INPUT_FASTA>.community.[0-9]{1,}.paf.chunk_[0-9]{1,}.paf: Community PAF file containing base level alignments of a specific chunk of the specific community. Only appears when --communities is provided.

wfmash align

Here wfmash was applied in base pair level alignment mode in order to refine the approximate all-all alignments of all input sequences.

Output files
  • wfmash_map/
    • <INPUT_FASTA>.chunk_[0-9]{1,}.paf: PAF file containing base level alignments of a specific chunk.
    • <INPUT_FASTA>.community.[0-9]{1,}.chunk_[0-9]{1,}.paf: Community PAF file containing base level alignments of a specific chunk of the specific community. Only appears when --communities is provided.

seqwish

seqwish implements a lossless conversion from pairwise alignments between sequences to a variation graph encoding the sequences and their alignments.

Output files
  • seqwish/
    • <INPUT_FASTA>.seqwish.gfa: Raw pangenome graph induced from the all-versus-all alignments.
    • <INPUT_FASTA>.community.[0-9]{1,}.seqwish.gfa: Community GFA file containing the raw pangenome graph induced from the all-versus-all alignments of the specific community. Only appears when --communities is provided.

smoothxg

smoothxg finds blocks of paths that are collinear within a variation graph. It applies partial order alignment to each block, yielding an acyclic variation graph. Then, to yield a “smoothed” graph, it walks the original paths to lace these subgraphs together. The resulting graph only contains cyclic or inverting structures larger than the chosen block size, and is otherwise manifold linear.

Output files
  • smoothxg/
    • <INPUT_FASTA>.smoothxg.gfa: Smoothed pangenome graph in GFA format.
    • <INPUT_FASTA>.community.[0-9]{1,}.smoothxg.gfa: Community GFA file containing the smoothed pangenome graph of the specific community. Only appears when --communities is provided.

gfaffix

gfaffix identifies walk-preserving shared affixes in variation graphs and collapses them into a non-redundant graph structure.

Output files
  • gfaffix/
    • <INPUT_FASTA>.gfaffix.gfa: Non-node-redundant pangenome graph in GFA format.
    • <INPUT_FASTA>.gfaffix.txt: Graph shared affixes in TSV format.
    • <INPUT_FASTA>.community.[0-9]{1,}.gfaffix.gfa: Community GFA file containing the non-node-redundant pangenome graph of the specific community. Only appears when --communities is provided.
    • <INPUT_FASTA>.community.[0-9]{1,}.gfaffix.txt: Community TSV file containing the graph shared affixes in TSV format of the specific community. Only appears when --communities is provided.

odgi

odgi provides an efficient and succinct dynamic DNA sequence graph model, as well as a host of algorithms that allow the use of such graphs in bioinformatic analyses. In this pipeline, a huge variety of odgi’s subcommands are used to process the built graphs. As a rule of thumb, all files in ODGI format end with .og.

odgi build

odgi build constructs a dynamic succinct variation graph in ODGI format from a GFAv1.

Output files
  • odgi_build/
    • <INPUT_FASTA>.*.og: Graph in ODGI format.
    • <INPUT_FASTA>.community.[0-9]{1,}.*.og: Community ODGI file of the specific community. Only appears when --communities is provided.

odgi stats

odgi stats describes various metrics of a variation graph.

Output files
  • odgi_stats/
    • <INPUT_FASTA>.*.og.stats.yaml: YAML file with graph metrics.
    • <INPUT_FASTA>.community.[0-9]{1,}.*.og.stats.yaml: Community YAML file with graph metrics of the specific community. Only appears when --communities is provided.

odgi sort

odgi sort sorts a succinct variation graph. it offers a diverse palette of sorting algorithms to determine the node order.

In the pipeline itself, this is the last tool invoked before the final graph in ODGI format is written on disk.

Output files
  • FINAL_ODGI/
    • <INPUT_FASTA>.*.Ygs.og: Sorted variation graph in ODGI format.
    • <INPUT_FASTA>.community.[0-9]{1,}.*.Ygs.og: Community ODGI file with a sorted variation graph of the specific community. Only appears when --communities is provided.

The order of the sortings:

  • Y: PG-SGD
  • g: grooming
  • s: topolocigal sort

odgi unchop

odgi unchop merges unitigs into single nodes.

Output files
  • odgi_unchop/
    • <INPUT_FASTA>.*.unchop.og: Unchopped variation graph in ODGI format.
    • <INPUT_FASTA>.community.[0-9]{1,}.*.unchop.og: Community ODGI file with a unchopped variation graph of the specific community. Only appears when --communities is provided.

odgi view

odgi view can convert a graph in ODGI format to GFAv1.

Output files
  • FINAL_GFA/
    • <INPUT_FASTA>.*.view.og: Final graph in GFAv1 format.
    • <INPUT_FASTA>.community.[0-9]{1,}.*.view.og: Final community GFAv1 file the specific community. Only appears when --communities is provided.

odgi viz

odgi viz visualizes a variation graph in 1D.

Output files
  • odgi_viz/
    • <INPUT_FASTA>.gfaffix.viz_*_multiqc.png: 1D visualization of a genome variation graph in PNG format ready to be put into a MultiQC report.
    • <INPUT_FASTA>.squeeze.viz_*_multiqc.png: 1D visualization of all communities combined in one graph in PNG format ready to be put into a MultiQC report. Only appears when --communities is provided.
    • <INPUT_FASTA>.community.[0-9]{1,}.gfaffix.viz_*_multiqc.png: 1D visualizaton of a community genome variation graph ready to be put into a MultiQC report. Only appears when --communities is provided.

odgi layout

odgi layout uses the PG-SGD algorithm to calculate a 2D layout of a variation graph. The layout in TSV format and a corresponding GFAv1 graph from folder FINAL_GFA can be loaded into waragraph for an interactive visualization.

Output files
  • odgi_layout/
    • <INPUT_FASTA>.gfaffix.tsv: 2D layout in TSV format.
    • <INPUT_FASTA>.gfaffix.lay: 2D layout in binary LAY format.
    • <INPUT_FASTA>.squeeze.tsv: 2D layout in TSV format of all communities combined in one graph. Only appears when --communities is provided.
    • <INPUT_FASTA>.squeeze.tsv: 2D layout in binary LAY format of all communities combined in one graph. Only appears when --communities is provided.
    • <INPUT_FASTA>.community.[0-9]{1,}.gfaffix.tsv: 2D layout in TSV format of a community genome variation graph. Only appears when --communities is provided.
    • <INPUT_FASTA>.community.[0-9]{1,}.gfaffix.tsv: 2D layout in binary LAY format of a community genome variation graph. Only appears when --communities is provided.

odgi draw

odgi draw takes a 2D graph layout in binary LAY format and a corresponding variation graph in GFAv1 or ODGI format and renders a static 2D visualization.

Output files
  • odgi_draw/
    • <INPUT_FASTA>.gfaffix.draw_multiqc.png: 2D visualization of a genome variation graph in PNG format ready to be put into a MultiQC report.
    • <INPUT_FASTA>.gfaffix.png: Low resolution 2D visualization of a genome variation graph in PNG format.
    • <INPUT_FASTA>.squeeze.draw_multiqc.png: 2D visualization of all communities combined in one graph in PNG format ready to be put into a MultiQC report. Only appears when --communities is provided.
    • <INPUT_FASTA>.squeeze.png: Low resolution 2D visualization of all communities combined in one graph in PNG format. Only appears when --communities is provided.
    • <INPUT_FASTA>.community.[0-9]{1,}.gfaffix.draw_multiqc.png: 2D visualizaton of a community genome variation graph ready to be put into a MultiQC report. Only appears when --communities is provided.
    • <INPUT_FASTA>.community.[0-9]{1,}.gfaffix..png: Low resolution 2D visualizaton of a community genome variation graph. Only appears when --communities is provided.

odgi squeeze

odgi squeeze puts multiple variation graphs in one file.

Output files
  • FINAL_ODGI/
    • <INPUT_FASTA>.squeeze.og: All graphs of all communities combined in one graph in ODGI format. Only appears when --communities is provided.

vg

vg is the variation graph toolkit for data structures, interchange formats, alignment, genotyping, and variant calling methods of genome variation graphs.

vg deconstruct

vg deconstruct outputs VCF records for snarls present in a graph relative to one or several chosen reference path(s).

Output files
  • vg_deconstruct/
    • <INPUT_FASTA>.gfafix.*.vcf: Variants in VCF format of the graph.
    • <INPUT_FASTA>.gfafix.*.vcf.stats: Statistics of the variants in VCF format of the graph.
    • <INPUT_FASTA>.gfafix.*.decomposed.vcf: Decomposed variants in VCF format of the graph.
    • <INPUT_FASTA>.gfafix.*.decomposed.vcf.stats: Statistics of the decomposed variants in VCF format of the graph.
    • <INPUT_FASTA>.squeeze.*.vcf: Variants in VCF format of the graph containing all communities. Only appears when --communities is provided.
    • <INPUT_FASTA>.squeeze.*.vcf.stats: Statistics of the variants in VCF format of the graph containing all communities. Only appears when --communities is provided.
    • <INPUT_FASTA>.squeeze.*.decomposed.vcf: Decomposed variants in VCF format of the graph containing all communities. Only appears when --communities is provided.
    • <INPUT_FASTA>.squeeze.*.decomposed.vcf.stats: Statistics of the decomposed variants in VCF format of the graph containing all communities. Only appears when --communities is provided.

MultiQC

In the ODGI table section of the MultiQC report, it can happen that one observes the actual sample name and seqwish. The seqwish sample is the graph which was produced by seqwish. The named sample, which just contains the sample name in the name is the final graph:

Output files
  • multiqc/
    • multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
    • multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
    • multiqc_plots/: directory containing static images from the report in various formats.

MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
    • Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.