Introduction

This document describes the output produced by the pipeline. All paths are relative to the top-level results directory specified with --outdir.

Pipeline overview

nf-core/mspepid processes mass spectrometry data through the following steps:

  1. Database preparation - optional entrapment database creation and decoy sequence generation
  2. Spectra preparation - decompression and vendor-format conversion to mzML
  3. Spectra identification - database search with Comet and/or Sage
  4. Rescoring - FDR-based PSM rescoring with Percolator and/or MS2Rescore

Database preparation

Decoy generation

Decoy protein sequences are automatically appended to the target database unless --skip_decoy_generation is set. Decoys are generated by the OpenMS DecoyDatabase tool using sequence reversal as default strategy.

Entrapment database

When --entrapment_fold is greater than 0, an entrapment database is created by FDRBench prior to decoy generation. Entrapment sequences (shuffled target proteins) are used for independent FDR estimation during benchmarking.


Spectra identification

For each enabled search engine, PSM results are written under a subdirectory named after the search engine (e.g. comet/ or sage/). All results are converted to a common format with psm-utils for downstream processing.

Comet

Output files
  • comet/
    • *.mzid - PSMs in mzIdentML standard format (default output of Comet)
    • *.comet.params - the final input parameters of the Comet run.
    • *.psmutils.tsv - PSMs in psm-utils TSV format, used as input to rescoring steps.
    • *.pin - Percolator input file (PIN format) containing search engine features.
    • Additional outputs can be activated, see module description of Comet

Comet is a widely used open-source database search engine for tandem mass spectra.

Sage

Output files
  • sage/
    • *.results.sage.tsv - PSMs in Sage’s default TSV output format.
    • *.results.json - the final input parameters of the Sage run.
    • *.psmutils.tsv - PSMs in psm-utils TSV format, used as input to rescoring steps.
    • *.pin - Percolator input file (PIN format) containing search engine features.

Sage is a fast, Rust-based database search engine that scales efficiently to large datasets and large protein databases.


Rescoring

PSMs from each search engine are rescored independently. Output directories are nested under the search engine directory.

Percolator

Output files
  • <searchengine>/percolator/
    • *.target.psms - Target PSMs scored and filtered by Percolator.
    • *.decoy.psms - Decoy PSMs (used for FDR calibration).
    • Additional outputs can be activated, see module description of Percolator

Percolator applies semi-supervised machine learning (using a SVM) to re-rank PSMs using search engine scores and auxiliary features, then estimates FDR using a target-decoy competition approach.

MS2Rescore / Tims2Rescore

Output files
  • <searchengine>/ms2rescore/
    • *.target.psms - Target PSMs rescored after MS2PIP feature augmentation.
    • *.decoy.psms - Decoy PSMs.
    • *.pin- The original PSMs after identification with additional features created my MS2PIP

MS2Rescore generates additional rescoring features by comparing observed fragment ion spectra to spectra predicted by MS2PIP. The augmented feature set is then passed to Percolator for final scoring. This typically improves PSM identifications at a given FDR threshold, particularly for challenging samples such as immunopeptidomes or non-tryptic digests. I TIMS data is used as input, automatically Tims2Rescore is applied (which is the default behaviour of newer MS2Rescore implementations).

The MS2PIP fragmentation model is controlled by --ms2rescore_model (default: HCD).


Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameters are used when running the pipeline.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
    • Parameters used by the pipeline run: params.json.
    • Software versions used: nf_core_mspepid_software_versions.yml.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.