Introduction

This document describes the output produced by the pipeline. Most of the plots are taken from the report, which summarizes results at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

Query identification

Output files
  • seqinfo/
    • *_id.txt: File containing Uniprot identifier of the query or the closest BLAST hit.
    • *_taxid.txt: File containing NCBI taxon ID of the query/closest hit.
    • *_exact.txt: File containing information on whether the query was found in the database (true), or the output is the top BLAST hit (false).

Query information necessary for further steps is obtained here. If a sequence was passed, it is identified using OMA. A Uniprot identifier is obtained, along with indication whether it was an exact or closest match. For either query type, an NCBI taxon ID is obtained using the OMA API.

Ortholog fetching

Output files
  • orthologs/
    • [dbname]/
      • *_[dbname]_group.csv: A CSV file with the hits from the database. It has an additional column necessary for later merging.

Ortholog predictions are fetched from the databases. Each database can be used locally or online, subject to the feasibility of these access modes. The databases currently supported are:

  • OMA (online and local)
  • PANTHER (online and local)
  • OrthoInspector (online)
  • EggNOG (local).

Ortholog scoring

Output files
  • orthologs/
    • merge_csv/
      • *.csv: A merged CSV file with predictions from all the databases.
    • score_table/
      • *_score_table.csv: A merged CSV with a score column added. The score is the number of databases supporting the prediction.

At this step, the predictions are combined into a single table. They are also assigned a score which is used for later filtering. The score is the number of supporting sources.

Ortholog filtering

Output files
  • orthologs/
    • filter_hits/
      • *_minscore_*.txt: Lists of predictions passing different score thresholds, from 1 to the number of sources. For example, BicD2_minscore_2.txt would include orthologs of BicD2 supported by at least 2 sources.
      • *_centroid.txt: A list of predictions from the source with the highest agreement with other sources.
      • *_filtered_hits.txt: The final list of orthologs, chosen based on user-defined criteria.

In this step, the predictions are split into lists with different minimal scores, indicating each level of support. Additionally, the source with the highest total agreement is found.

The final list of orthologs is determined in one of two ways. If --use_centroid is set, the highest-agreement source will be used. Otherwise, orthologs with a score higher than --min_score are used.

Ortholog plotting

Output files
  • orthologs/
    • plots/
      • *_supports.png: A bar plot representing the number of predictions from each source and the support of the predictions.
      • *_venn.png: A Venn diagram representing the intersections between databases.
      • *_jaccard.png: A tile plot representing the Jaccard index (pairwise agreement) between databases.

Plots representing certain aspects of the predictions are generated using ggplot.

Ortholog statistics

Output files
  • orthologs/
    • stats/
      • *_stats.yml: A YAML file containing ortholog statistics.
    • hits/
      • *_hits.yml: A YAML file containing hit counts per database.

The following statistics of the predictions are calculated:

  • percentage of consensus - the fraction of predictions which are supported by all the sources
  • percentage of privates - the fractions of predictions which are supported by only 1 source
  • goodness - the ratio of the real sum of scores to the theoretical maximum (i.e. the number of databases times the number of predictions).

Sequence fetching

Output files
  • sequences/
    • *_orthologs.fa: A FASTA file containing all ortholog sequences that could be found.
    • *_seq_hits.txt: The list of all orthologs whose sequence was found.
    • *_seq_misses.txt: The list of all orthologs whose sequence was not found.

If downstream analysis is performed, protein sequences of all orthologs in FASTA format are fetched. The primary source of sequences is OMA due to its fast API. IDs not found in OMA are sent to Uniprot. Anything not found in Uniprot is considered a miss.

Structure fetching

Output files
  • sequences/
    • *.pdb: PDB files with structures of the orthologs, obtained from AlphaFoldDB.
    • *_af_versions.txt: Versions of the AlphaFold structures.
    • *_str_hits.txt: The list of all orthologs whose structure was found.
    • *_str_misses.txt: The list of all orthologs whose structure was not found.

If --use_structures is set, structures from the alignment are obtained from AlphaFoldDB. For feasibility of AlphaFold structures for MSA, check Baltzis et al. 2022.

MSA

Output files
  • alignment/
    • *.aln: A multiple sequence alignment of the orthologs in Clustal format.

Multiple sequence alignment is performed using T-COFFEE. 3D-COFFEE mode is used if --use_structures is set. Otherwise, default mode is used.

Tree reconstruction

Output files
  • trees/
    • iqtree/
      • *.treefile: The IQTREE phylogeny in Newick format.
      • *.ufboot: Bootstrap trees, if generated.
    • fastme/
      • *.nwk: The FastME phylogeny in Newick format.
      • *.bootstrap: The bootstrap trees, if generated.
    • plots/
      • *_iqtree_tree.png: The IQTREE phylogeny as an image.
      • *_fastme_tree.png: The FastME phylogeny as an image.

The phylogeny can be constructed using maximum likelihood (IQTREE) or minimum evolution (FastME).

Report generation

Output files
  • *_dist/
    • *.html: The report in HTML format.
    • run.sh: A script to correctly open the report.
    • Other files necessary for the report.
  • multiqc/
    • multiqc_report.html: A MultiQC report containing summary of all samples.

The report is generated per sample in the form of a React application. It must be hosted on localhost to work correctly. This can be done manually or with the run script provided.

A single MultiQC report is also generated. It contains a comparison of hit count and statistics for each sample, as well as a list of software versions used in the run.

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
    • Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.