nf-core/pangenome
Renders a collection of sequences into a pangenome graph. https://doi.org/10.1093/bioinformatics/btae609.
Introduction
This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- Input Check
- Community Detection
- wfmash
- seqwish
- smoothxg
- gfaffix
- odgi
- vg
- MultiQC - Aggregate report describing results and QC from the whole pipeline -> final report(s)!
- Pipeline information - Report metrics generated during the workflow execution
Input Check
bgzip
bgzip compresses a file into a series of small ‘BGZF’ blocks in a similar manner to, and compatible with, gzip. This allows indexes to be built against the compressed file and used to retrieve portions of the data without having to decompress the entire file.
Output files
tabix_bgzip/
<INPUT_FASTA>.gz
: The bgzip compressed input FASTA file.<INPUT_FASTA>.community.[0-9]{1,}.fa.gz
: A bgzipped compressed community FASTA file. Only appears when--communities
is provided.
samtools faidx
samtools faidx indexes or queries regions from a FASTA file.
Output files
samtools_faidx/
<INPUT_FASTA>.fai
: FASTA index of the file provided via--input
.<INPUT_FASTA>.gzi
: Compressed FASTA index of the file provided via--input
.<INPUT_FASTA>.community.[0-9]{1,}.fa.gz.fai
: FASTA index of a community FASTA file. Only appears when--communities
is provided.<INPUT_FASTA>.community.[0-9]{1,}.fa.gz.gzi
: Compressed FASTA index of a community FASTA file. Only appears when--communities
is provided.
Community Detection
paf2net
paf2net is python script that projects wfmash’s PAF mappings (the implied overlap and containment graph) into an edge list, a list of edge weights, and an ‘id to sequence name’ map.
Output files
paf2net/
<INPUT_FASTA>.paf.vertices.id2name.txt
: TXT file with a mapping of vertex identifiers to sequence names. Only appears when--communities
is provided.<INPUT_FASTA>.paf.edges.weights.txt
: TXT file with the weights of the edges. Only appears when--communities
is provided.<INPUT_FASTA>.paf.edges.list.txt
: TXT file listing all edge connections. Only appears when--communities
is provided.
net2communities
net2communities is a python script that detects communities by applying the Leiden algorithm (Trag et al., Nature 2019).
Output files
net2communities/
<INPUT_FASTA>.community.[0-9](1,).txt
: TXT file with the sequence names of the community. Only appears when--communities
is provided.
extract communities
extract communities is a locally modified module of samtools faidx. The original module only allowed for the indexing of FASTA files, but not for the extraction of single sequences.
Output files
extract_communities/
<INPUT_FASTA>.community.[0-9](1,).txt.fa
: A community FASTA file. Only appears when--communities
is provided.
wfmash
wfmash is an aligner for pangenomes based on sparse homology mapping and wavefront inception. In this pipeline it is used in various processes and ways.
Output files
wfmash/
<INPUT_FASTA>.paf
: PAF file containing CIGAR strings of all pairwise alignments.<INPUT_FASTA>.community.[0-9]{1,}.paf
: Community PAF file containing CIGAR strings of all pairwise alignments of the specific community. Only appears when--communities
is provided.
wfmash map community
Here wfmash
was applied in approximate mappping mode in order to create all-all alignments of all input sequences for their subsequent community discovery.
Output files
wfmash_map_community/
<INPUT_FASTA>.paf
: PAF file containing approximate mappings of all pairwise alignments.
wfmash map
Here wfmash
was applied in approximate mappping mode in order to create all-all alignments of all input sequences. We can then split the base pair level alignment problem into several equal problem sizes with split approx mappings in chunks.
Output files
wfmash_map/
<INPUT_FASTA>.paf
: PAF file containing approximate mappings of all pairwise alignments.<INPUT_FASTA>.community.[0-9]{1,}.paf
: Community PAF file containing approximate mappings of all pairwise alignments of the specific community. Only appears when--communities
is provided.
split approx mappings in chunks
split approx mappings in chunks is a python script that takes the approximate mappings, weighs each mapping by computing its length * (1 - estimated identity), then creates N new files where the mapping sets have a similar sum of weights.
Output files
wfmash_map/
<INPUT_FASTA>.paf.chunk_[0-9]{1,}.paf
: PAF file containing base level alignments of a specific chunk.<INPUT_FASTA>.community.[0-9]{1,}.paf.chunk_[0-9]{1,}.paf
: Community PAF file containing base level alignments of a specific chunk of the specific community. Only appears when--communities
is provided.
wfmash align
Here wfmash
was applied in base pair level alignment mode in order to refine the approximate all-all alignments of all input sequences.
Output files
wfmash_map/
<INPUT_FASTA>.chunk_[0-9]{1,}.paf
: PAF file containing base level alignments of a specific chunk.<INPUT_FASTA>.community.[0-9]{1,}.chunk_[0-9]{1,}.paf
: Community PAF file containing base level alignments of a specific chunk of the specific community. Only appears when--communities
is provided.
seqwish
seqwish implements a lossless conversion from pairwise alignments between sequences to a variation graph encoding the sequences and their alignments.
Output files
seqwish/
<INPUT_FASTA>.seqwish.gfa
: Raw pangenome graph induced from the all-versus-all alignments.<INPUT_FASTA>.community.[0-9]{1,}.seqwish.gfa
: Community GFA file containing the raw pangenome graph induced from the all-versus-all alignments of the specific community. Only appears when--communities
is provided.
smoothxg
smoothxg finds blocks of paths that are collinear within a variation graph. It applies partial order alignment to each block, yielding an acyclic variation graph. Then, to yield a “smoothed” graph, it walks the original paths to lace these subgraphs together. The resulting graph only contains cyclic or inverting structures larger than the chosen block size, and is otherwise manifold linear.
Output files
smoothxg/
<INPUT_FASTA>.smoothxg.gfa
: Smoothed pangenome graph in GFA format.<INPUT_FASTA>.community.[0-9]{1,}.smoothxg.gfa
: Community GFA file containing the smoothed pangenome graph of the specific community. Only appears when--communities
is provided.
gfaffix
gfaffix identifies walk-preserving shared affixes in variation graphs and collapses them into a non-redundant graph structure.
Output files
gfaffix/
<INPUT_FASTA>.gfaffix.gfa
: Non-node-redundant pangenome graph in GFA format.<INPUT_FASTA>.gfaffix.txt
: Graph shared affixes in TSV format.<INPUT_FASTA>.community.[0-9]{1,}.gfaffix.gfa
: Community GFA file containing the non-node-redundant pangenome graph of the specific community. Only appears when--communities
is provided.<INPUT_FASTA>.community.[0-9]{1,}.gfaffix.txt
: Community TSV file containing the graph shared affixes in TSV format of the specific community. Only appears when--communities
is provided.
odgi
odgi provides an efficient and succinct dynamic DNA sequence graph model, as well as a host of algorithms that allow the use of such graphs in bioinformatic analyses. In this pipeline, a huge variety of odgi’s subcommands are used to process the built graphs. As a rule of thumb, all files in ODGI format end with .og
.
odgi build
odgi build constructs a dynamic succinct variation graph in ODGI format from a GFAv1.
Output files
odgi_build/
<INPUT_FASTA>.*.og
: Graph in ODGI format.<INPUT_FASTA>.community.[0-9]{1,}.*.og
: Community ODGI file of the specific community. Only appears when--communities
is provided.
odgi stats
odgi stats describes various metrics of a variation graph.
Output files
odgi_stats/
<INPUT_FASTA>.*.og.stats.yaml
: YAML file with graph metrics.<INPUT_FASTA>.community.[0-9]{1,}.*.og.stats.yaml
: Community YAML file with graph metrics of the specific community. Only appears when--communities
is provided.
odgi sort
odgi sort sorts a succinct variation graph. it offers a diverse palette of sorting algorithms to determine the node order.
In the pipeline itself, this is the last tool invoked before the final graph in ODGI format is written on disk.
Output files
FINAL_ODGI/
<INPUT_FASTA>.*.Ygs.og
: Sorted variation graph in ODGI format.<INPUT_FASTA>.community.[0-9]{1,}.*.Ygs.og
: Community ODGI file with a sorted variation graph of the specific community. Only appears when--communities
is provided.
The order of the sortings:
Y
: PG-SGDg
: groomings
: topolocigal sort
odgi unchop
odgi unchop merges unitigs into single nodes.
Output files
odgi_unchop/
<INPUT_FASTA>.*.unchop.og
: Unchopped variation graph in ODGI format.<INPUT_FASTA>.community.[0-9]{1,}.*.unchop.og
: Community ODGI file with a unchopped variation graph of the specific community. Only appears when--communities
is provided.
odgi view
odgi view can convert a graph in ODGI format to GFAv1.
Output files
FINAL_GFA/
<INPUT_FASTA>.*.view.og
: Final graph in GFAv1 format.<INPUT_FASTA>.community.[0-9]{1,}.*.view.og
: Final community GFAv1 file the specific community. Only appears when--communities
is provided.
odgi viz
odgi viz visualizes a variation graph in 1D.
Output files
odgi_viz/
<INPUT_FASTA>.gfaffix.viz_*_multiqc.png
: 1D visualization of a genome variation graph in PNG format ready to be put into a MultiQC report.<INPUT_FASTA>.squeeze.viz_*_multiqc.png
: 1D visualization of all communities combined in one graph in PNG format ready to be put into a MultiQC report. Only appears when--communities
is provided.<INPUT_FASTA>.community.[0-9]{1,}.gfaffix.viz_*_multiqc.png
: 1D visualizaton of a community genome variation graph ready to be put into a MultiQC report. Only appears when--communities
is provided.
odgi layout
odgi layout uses the PG-SGD algorithm to calculate a 2D layout of a variation graph. The layout in TSV format and a corresponding GFAv1 graph from folder FINAL_GFA
can be loaded into waragraph for an interactive visualization.
Output files
odgi_layout/
<INPUT_FASTA>.gfaffix.tsv
: 2D layout in TSV format.<INPUT_FASTA>.gfaffix.lay
: 2D layout in binary LAY format.<INPUT_FASTA>.squeeze.tsv
: 2D layout in TSV format of all communities combined in one graph. Only appears when--communities
is provided.<INPUT_FASTA>.squeeze.tsv
: 2D layout in binary LAY format of all communities combined in one graph. Only appears when--communities
is provided.<INPUT_FASTA>.community.[0-9]{1,}.gfaffix.tsv
: 2D layout in TSV format of a community genome variation graph. Only appears when--communities
is provided.<INPUT_FASTA>.community.[0-9]{1,}.gfaffix.tsv
: 2D layout in binary LAY format of a community genome variation graph. Only appears when--communities
is provided.
odgi draw
odgi draw takes a 2D graph layout in binary LAY format and a corresponding variation graph in GFAv1 or ODGI format and renders a static 2D visualization.
Output files
odgi_draw/
<INPUT_FASTA>.gfaffix.draw_multiqc.png
: 2D visualization of a genome variation graph in PNG format ready to be put into a MultiQC report.<INPUT_FASTA>.gfaffix.png
: Low resolution 2D visualization of a genome variation graph in PNG format.<INPUT_FASTA>.squeeze.draw_multiqc.png
: 2D visualization of all communities combined in one graph in PNG format ready to be put into a MultiQC report. Only appears when--communities
is provided.<INPUT_FASTA>.squeeze.png
: Low resolution 2D visualization of all communities combined in one graph in PNG format. Only appears when--communities
is provided.<INPUT_FASTA>.community.[0-9]{1,}.gfaffix.draw_multiqc.png
: 2D visualizaton of a community genome variation graph ready to be put into a MultiQC report. Only appears when--communities
is provided.<INPUT_FASTA>.community.[0-9]{1,}.gfaffix..png
: Low resolution 2D visualizaton of a community genome variation graph. Only appears when--communities
is provided.
odgi squeeze
odgi squeeze puts multiple variation graphs in one file.
Output files
FINAL_ODGI/
<INPUT_FASTA>.squeeze.og
: All graphs of all communities combined in one graph in ODGI format. Only appears when--communities
is provided.
vg
vg is the v
ariation g
raph toolkit for data structures, interchange formats, alignment, genotyping, and variant calling methods of genome variation graphs.
vg deconstruct
vg deconstruct outputs VCF records for snarls present in a graph relative to one or several chosen reference path(s).
Output files
vg_deconstruct/
<INPUT_FASTA>.gfafix.*.vcf
: Variants in VCF format of the graph.<INPUT_FASTA>.gfafix.*.vcf.stats
: Statistics of the variants in VCF format of the graph.<INPUT_FASTA>.gfafix.*.decomposed.vcf
: Decomposed variants in VCF format of the graph.<INPUT_FASTA>.gfafix.*.decomposed.vcf.stats
: Statistics of the decomposed variants in VCF format of the graph.<INPUT_FASTA>.squeeze.*.vcf
: Variants in VCF format of the graph containing all communities. Only appears when--communities
is provided.<INPUT_FASTA>.squeeze.*.vcf.stats
: Statistics of the variants in VCF format of the graph containing all communities. Only appears when--communities
is provided.<INPUT_FASTA>.squeeze.*.decomposed.vcf
: Decomposed variants in VCF format of the graph containing all communities. Only appears when--communities
is provided.<INPUT_FASTA>.squeeze.*.decomposed.vcf.stats
: Statistics of the decomposed variants in VCF format of the graph containing all communities. Only appears when--communities
is provided.
MultiQC
In the ODGI table section of the MultiQC report, it can happen that one observes the actual sample name and seqwish
.
The seqwish
sample is the graph which was produced by seqwish.
The named sample, which just contains the sample name in the name is the final graph:
- seqwish -> smoothxg -> gfaffix -> odgi build -> odgi unchop -> odgi sort
Output files
multiqc/
multiqc_report.html
: a standalone HTML file that can be viewed in your web browser.multiqc_data/
: directory containing parsed statistics from the different tools used in the pipeline.multiqc_plots/
: directory containing static images from the report in various formats.
MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.
Pipeline information
Output files
pipeline_info/
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.dot
/pipeline_dag.svg
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.yml
. Thepipeline_report*
files will only be present if the--email
/--email_on_fail
parameter’s are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
. - Parameters used by the pipeline run:
params.json
.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.