nf-core/diseasemodulediscovery
A pipeline for network-based disease module identification.
Introduction
This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
File naming
Many output files are named based on the combination of seed file, network file, disease module discovery method, and drug prioritization method used:
<seeds>
: The name of the input seed file without the file ending.<network>
: The name of the input network file without the file ending.<amim>
: The name of the disease module discovery method (active module identification method (AMIM)).<drug_algorithm>:
: The name of the drug prioritization algorithm.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- Input preparation
- Disease module inference
- Disease module evaluation
- Drug prioritization
- Other
- Reporting
- MultiQC - Aggregate report describing results and QC from the whole pipeline
- Pipeline information - Report metrics generated during the workflow execution
Input preparation
Prepare network
The graph-tool library is used to parse the input network(s) into the .gt
format, the internal representation used for networks within the pipeline. Additionally, it is used to generate networks in the specific formats required by the various disease module inference methods. This step also gathers summary statistics for the MultiQC report, including the number of nodes and edges, the network diameter, the number of connected components, the size of the largest connected component, the count of self-loops (nodes with edges to themselves), and the number of duplicate edges (multiple edges connecting the same two nodes).
Output files
input/networks/
<network>.gt
: Parsed input network in.gt
format.<network>.domino.sif
: Input network in the format required for DOMINO. Only created if the method is used.<network>.diamond.csv
: Input network in the format required for DIAMOnD. Only created if the method is used.<network>.robust.tsv
: Input network in the format required for ROBUST or ROBUST (bias aware). Only created if the methods are used.<network>.rwr.csv
: Input network in the format required for RWR. Only created if the method is used.
mqc_summaries/
input_network_mqc.tsv
: Network summary statistics for the MultiQC report.
Check seed file
The format of the input seed file(s) is validated, and any seed nodes not present in the corresponding input network are removed. The filtered seed file(s) are then used in subsequent pipeline steps, and a summary of retained and discarded seed nodes is included in the MultiQC report.
Output files
input/seeds/
<seeds>.<network>.tsv
: Filtered seed files containing only nodes present in the corresponding network.<seeds>.<network>.removed.tsv
: The dropped nodes not being present in the corresponding network.
mqc_summaries/
input_network_mqc.tsv
: Summary about retained and dropped seed nodes for the MultiQC report.
Disease module inference
The inferred disease modules are exported in multiple formats, including .gt
, .graphml
, and node and edge lists in .tsv
. If a method returns only a node list rather than a full network, the connecting edges are extracted from the input network. Module nodes are annotated with their seed status (is_seed
), their subnetwork participation degree (spd
), and a component identifier (component_id
) to indicate which connected component they belong to. Additionally, tool-specific node properties are added, which are explained in the sections below.
Only seeds
In addition to the inferred disease modules, the pipeline provides a dummy module inference method that returns only the seed nodes and the edges connecting them in the input network. This serves as a baseline, enabling comparisons between modules containing additional nodes and the core set of seed nodes.
Output files
modules/{gt,graphml,tsv_nodes,tsv_edges}/
<seeds>.<network>.no_tool.{gt,grahml,nodes.tsv,edges.tsv}
: Module containing only the seed nodes in different formats.
DOMINO
DOMINO starts by partitioning the input network into disjoint slices using Louvain clustering. Slices that are enriched for seed nodes, as determined by a hypergeometric test, are selected for further analysis. The selected slices are refined by solving the Prize Collecting Steiner Tree (PCST) problem, and subsequently subdivided into putative modules containing no more than 10 nodes each. Each resulting module is then tested again for seed enrichment using a hypergeometric test. DOMINO can produce multiple non-overlapping modules for a single input seed set. The pipeline reports all modules in a single output file, with individual modules distinguished by the node property submodule
.
Output files
modules/{gt,graphml,tsv_nodes,tsv_edges}/
<seeds>.<network>.domino.{gt,grahml,nodes.tsv,edges.tsv}
: DOMINO module in different formats.
DIAMOnD
DIAMOnD iteratively expands the initial set of seed nodes by adding one node at a time. At each step, the algorithm selects the node with the highest connectivity significance to the current seed set, as determined by a hypergeometric test. This process continues until a predefined number of nodes have been incorporated. DIAMOnD itself only returns the nodes added to the module, which is why the pipeline adds the seed nodes to the module at the end. DIAMOnD returns only the nodes added to the module, so the pipeline appends the original seed nodes to the module at the end. Each node is annotated with the order in which it was added (rank
) and the corresponding hypergeometric test p-value (p_hyper
). Both values are 0 for seed nodes.
Output files
modules/{gt,graphml,tsv_nodes,tsv_edges}/
<seeds>.<network>.diamond.{gt,grahml,nodes.tsv,edges.tsv}
: DIAMOnD module in different formats.
ROBUST
ROBUST repeatedly connects seed nodes by solving the Prize Collecting Steiner Tree (PCST) problem. In each iteration, nodes that were included in previous solutions are penalized, lowering their chance of being selected again. The final disease module consists of nodes that appear in a sufficient number of solutions, enhancing robustness. ROBUST annotates module nodes with a connected component ID (connected_components_id
), seed status (isSeed
), the number of solutions the node participated in (nrOfOccurrences
), the fraction of all solutions the node appeared in (significance
), and the list of trees
the node was part of.
Output files
modules/{gt,graphml,tsv_nodes,tsv_edges}/
<seeds>.<network>.robust.{gt,grahml,nodes.tsv,edges.tsv}
: ROBUST module in different formats.
ROBUST (bias aware)
ROBUST (bias aware) follows the same strategy as ROBUST but increases edge costs for nodes that are frequently used as baits in PPI detection experiments. This penalization helps to mitigate study bias present in current PPI networks. The added node annotations are the same as for ROBUST.
Output files
modules/{gt,graphml,tsv_nodes,tsv_edges}/
<seeds>.<network>.robust_bias_aware.{gt,grahml,nodes.tsv,edges.tsv}
: ROBUST (bias aware) module in different formats.
RWR
Output files
modules/{gt,graphml,tsv_nodes,tsv_edges}/
<seeds>.<network>.rwr.{gt,grahml,nodes.tsv,edges.tsv}
: RWR module in different formats.
1st Neighbors
1st Neighbors includes all network nodes that are directly connected to at least one seed node.
Output files
modules/{gt,graphml,tsv_nodes,tsv_edges}/
<seeds>.<network>.firstneighbor.{gt,grahml,nodes.tsv,edges.tsv}
: 1st Neighbors module in different formats.
Disease module evaluation
g:Profiler
g:Profiler is used via the R package gprofiler2 to perform over-representation analysis (ORA), i.e., to find gene sets of pathways in which the module nodes are enriched. For this, the module nodes are used as foreground, and all nodes of the corresponding input network as background.
Output files
Output file documentation is based on the nf-core module gprofiler2_gost.
evaluation/gprofiler/<seeds>.<network>.<amim>/
<seeds>.<network>.<amim>.gprofiler2.all_enriched_pathways.tsv
: Table listing all enriched pathways that were found.<seeds>.<network>.<amim>.gprofiler2.gostplot.html
: Interactive Manhattan plot of all enriched pathways.<seeds>.<network>.<amim>.gprofiler2.gostplot.png
: Manhattan plot of all enriched pathways.<seeds>.<network>.<amim>.gprofiler2.gost_results.rds
: R object containing the results of the gost query.<seeds>.<network>.<amim>.gprofiler2.<source>.sub_enriched_pathways.tsv
: Table listing enriched pathways that were found from one particular source.<seeds>.<network>.<amim>.gprofiler2.<source>.sub_enriched_pathways.png
: Bar plot showing the fraction of genes that were found enriched in each pathway.*ENSG_filtered.gmt
: GMT file that was provided as input or that was downloaded from g:profiler if no input GMT file was given; filtered for the selected datasources.R_sessionInfo.log
: Log file containing information about the R session that was run for this module.
DIGEST
DIGEST is a tool designed to evaluate the functional coherence of disease modules. It assumes that genes within a module should participate in related biological processes, as indicated by annotations from Gene Ontology (GO) — including Biological Process (GO.BP), Cellular Component (GO.CC), and Molecular Function (GO.MF) — as well as from KEGG pathways.
DIGEST is executed in two modes:
- Reference-free mode (
subnetwork
) – evaluates the functional coherence among all genes in the module. - Reference-based mode (
subnetwork-set
) – compares the coherence of the original seed genes with the genes added during module expansion.
Both modes use Jaccard similarity to measure functional overlap and generate 1,000 random modules from the input network(s) to perform permutation-based significance testing. The resulting empirical p-values are summarized in the MultiQC report.
Output files
-
evaluation/digest/{reference-free,reference-based}/<seeds>.<network>.<amim>/
<seeds>.<network>.<amim>_JI-based_p-value.png
: Scatter plot visualizing the empirical functional coherence p-values.<seeds>.<network>.<amim>_p-value_validation.csv
: Table with empirical functional coherence p-values for each gene set/pathway source.<seeds>.<network>.<amim>_input_validation.csv
: Table with the functional coherence scores.<seeds>.<network>.<amim>_result.json
: Full results as JSON file.<seeds>.<network>.<amim>_<source>_annotation_distribution.png
: Histogram showing the distribution of the number of associated gene sets/pathways for each query node.<seeds>.<network>.<amim>_<source>_sankey.png
: Sankey plot showing the top 10 most frequent gene sets/pathways linked to the query nodes.<seeds>.<network>.<amim>_JI-based_<source>_distribution.png
: Histogram of the distribution of the functional coherence score based on randomized data. The functional coherence score of the input is marked through a red vertical line.<seeds>.<network>.<amim>_mappability.png
: Bar plot showing the fraction of query nodes that were mappable to the different gene set/pathway sources.
-
mqc_summaries/
digest_{reference-free,reference-based}_mqc.tsv
: Summary of the empirical functional coherence p-values for the MultiQC report.
Network topology
The graph-tool library is used to compute summary statistics describing the topology of the disease modules. These include the number of nodes and edges, the count of the included seed nodes, the diameter, the number of connected components, the size of the largest component, the number of isolated nodes (nodes without edges), and the maximum shortest-path distance from any added node to its nearest seed node. These statistics are summarized in the General Statistics
section of the MultiQC report.
Output files
-
mqc_summaries/
topology_mqc.tsv
: Network topology measures of the disease modules for the MultiQC report.
Overlap
The pipeline calculates pairwise overlaps between the node sets of all modules to assess similarities between different seed sets, networks, or methods. For each pair of modules, it reports both the number of shared nodes between their node sets (A ∩ B
) and their Jaccard similarity (|A ∩ B| / |A ∪ B|
). To specifically assess similarities among the added nodes, the same measures are also computed on the sets A \ S
and B \ S
, where S
denotes the set of seed nodes. The overlaps are visualized as heatmaps in the MultiQC report.
Output files
-
mqc_summaries/
jaccard_similarity_matrix_mqc.tsv
: Pairwise Jaccard similarities between the node sets of the disease modules for the MultiQC report.jaccard_similarity_no_seeds_matrix_mqc.tsv
: Pairwise Jaccard similarities between the node sets of the disease modules for the MultiQC report, excluding the seed nodes for the calculation.shared_nodes_matrix_mqc.tsv
: Pairwise counts of shared nodes between the node sets of the disease modules are for the MultiQC report.shared_nodes_no_seeds_matrix_mqc.tsv
: Pairwise counts of shared nodes between the node sets of the disease modules are for the MultiQC report, excluding the seed nodes for the calculation.
Seed permutation
If provided with the --run_seed_permutation
parameter, the pipeline runs a leave-one-out analysis to check how robust a module discovery method is against small changes in the seed set and to calculate a rediscovery rate. Starting with the original seed set, the pipeline creates new versions of the set by leaving out one seed at a time. For each of these perturbed seed sets, a new disease module is inferred using the same method.
Robustness
Each perturbed module is compared to the original module to see how similar they are, using the Jaccard index (|A ∩ B| / |A ∪ B|
) of the node sets. The higher the Jaccard similarities, the more robust the module is to small input perturbations.
The corresponding distribution, as well as its mean value, is part of the MultiQC report.
Rediscovery rate
This procedure also allows calculation of a seed rediscovery rate — the likelihood that a left-out seed is added back into the resulting module. This metric reflects how well the method can recover disease-associated genes or proteins that were not provided in the input. On the other hand, if the rediscovery rate is consistently low across different methods, it may indicate that the left-out seed has weak or uncertain relevance.
Because larger modules are more likely to include an omitted seed by chance, the normalized rediscovery rate is adjusted for module size, i.e., divided by the number of nodes in the original module. This makes the rediscovery measure fairer and easier to compare across different modules.
Both normalized and raw rediscovery rates are summarized in the General Statistics
section of the MultiQC report.
Output files
-
evaluation/seed_permutation/<seeds>.<network>.<amim>/
<seeds>.<network>.<amim>.seed_permutation_evaluation_summary.tsv
: Leave-one-out analysis results aggregated across all iterations. Includes the mean Jaccard index, raw rediscovery rate, and normalized rediscovery rate.<seeds>.<network>.<amim>.seed_permutation_evaluation_detailed.tsv
: Leave-one-out analysis results on the level of individual iterations. Includes the mean Jaccard index, raw rediscovery rate, and normalized rediscovery rate.
-
evaluation/seed_permutation/
<seeds>.<network>.robustness.{png,pdf}
: Heatmap visualizing the robustness (indicated through the Jaccard index) of different AMIMs on the level of individual seed nodes. Rows are sorted by the row sum, columns are sorted by the column sum.<seeds>.<network>.robustness.tsv
: Table reporting the robustness (indicated through the Jaccard index) of different AMIMs on the level of individual seed nodes.<seeds>.<network>.seed_rediscovery.{png,pdf}
: Heatmap visualizing whether different AMIMs were able to recover individual seeds. Rows are sorted by the row sum, columns are sorted by the column sum.<seeds>.<network>.seed_rediscovery.tsv
: Table reporting whether different AMIMs were able to recover individual seeds.
-
mqc_summaries/
seed_permutation_mqc.tsv
: Summaries of the mean Jaccard index, raw rediscovery rate, and normalized rediscovery rate for the MultiQC report.seed_permutation_jaccard_mqc.yaml
: Jaccard index distributions for the MultiQC report.
Network permutation
Use the --run_network_permutation
option to repeatedly rewire the edges of the input network while preserving each node’s degree. The pipeline then reruns the module identification methods on these permuted networks.
The network rewiring is performed using the graph-tool function random_rewire
with "constrained-configuration"
as model and 100 full sweeps over all edges.
If the resulting modules are similar to those from the original network (indicated through a high Jaccard index), this indicates that the methods rely mainly on node degree rather than on the specific edge connections.
The corresponding distribution, as well as its mean value, is part of the MultiQC report.
Output files
-
mqc_summaries/
network_permutation_mqc.tsv
: Summaries of the mean Jaccard index for the MultiQC report.network_permutation_jaccard_mqc.yaml
: Jaccard index distributions for the MultiQC report.
-
input/permuted_networks/<network>/
<network>.*.gt
: The rewired networks. Can be reused for repeated analyses.
Drug prioritization
Drugst.One
The Drugst.One Python package identifies potential drug candidates targeting nodes in the disease modules. To prioritize these compounds, different algorithms are available:
- Degree centrality – ranks compounds by the number of module nodes they target.
- Harmonic centrality – considers the average shortest path from each compound to all module nodes.
- TrustRank – uses network propagation to rank compounds based on their relevance within the network.
More details on these algorithms are available in the supplementary material of the Drugst.One publication.
Output files
-
drug_prioritization/drugstone/
<seeds>.<network>.<amim>.<drug_algorithm>.drug_predictions.tsv
: Table containing disease module node annotations merged with the drug-prioritization results. The first columns correspond to the existing disease module node annotations. Additional columns indicate the prioritized compounds through their DrugBank ID (drug_id
), a prioritizationscore
depending on the used algorithm, and the name of the compound (drug_name
). If a single module node is targeted by multiple compounds,<seeds>.<network>.<amim>.<drug_algorithm>.csv
: Table containing the raw Drugst.One request results.
Other
Drugst.One export
The MultiQC report lists export links for each disease module to visualize and manipulate them directly through the Drugst.One web interface.
Output files
-
mqc_summaries/
drugstone_link_mqc.tsv
: Table with export links for the MultiQC report.
Network visualization
Visual network representations of the inferred modules — both with and without assigned drugs — are generated using the graph-tool package and are available in PNG, SVG, and PDF formats. Additionally, interactive HTML visualizations are produced using the pyvis package. Coloring distinguishes seed nodes from the added module nodes.
Output files
-
results/modules_visualized/{html,pdf,svg,png}/
<seeds>.<network>.<amim>.{html,pdf,svg,png}
: Network visualizations of the disease modules in different formats.
-
results/modules_visualized_with_drugs/{html,pdf,svg,png}/
<seeds>.<network>.<amim>.{html,pdf,svg,png}
: Network visualizations of the disease modules in different formats, including drug nodes. It will only be generated if drug prioritization was performed.
Annotation
The disease modules are annotated with supplementary biological information queried from NeDRex database. The annotated modules are saved using BioPax, short for Biological Pathway Exchange. BioPax is a standard language and format for representing biological pathway knowledge. The format of the files is validated using the BioPax Validator.
The resulting files have the following structure:
Entities
- Proteins with UnificationXref and ProteinReference
- Genes with UnificationXref
- SmallMolecules as drugs with UnificationXref and SmallMoleculeReference
Relationships
- Protein encoded by gene as RelationshipXref
- Gene associated with disorder as RelationshipXref
- Drug targets protein as RelationshipXref
- Drug has side effects as RelationshipXref
- Protein is in a cellular component as RelationshipXref
Interactions
- Protein interactions as MolecularInteraction
If the network hands over UniProt-IDs, only those are used and mapped to the encoding genes. If Entrez-IDs are given, all encoded proteins by those genes are considered.
Output files
-
modules/biopax/
<seeds>.<network>.<amim>.owl
: BioPax-files for each module in BioPax-format.
-
reports/
biopax-validator-report.html
: HTML file with the BioPax-validator results that can be viewed in your web browser.
Reporting
MultiQC
Output files
multiqc/
multiqc_report.html
: a standalone HTML file that can be viewed in your web browser.multiqc_data/
: directory containing parsed statistics from the different tools used in the pipeline.multiqc_plots/
: directory containing static images from the report in various formats.
MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.
Pipeline information
Output files
pipeline_info/
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.dot
/pipeline_dag.svg
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.yml
. Thepipeline_report*
files will only be present if the--email
/--email_on_fail
parameter’s are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
. - Parameters used by the pipeline run:
params.json
.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.