Proteomics label-free quantification (LFQ) analysis pipeline
This document describes the output produced by the pipeline.
The pipeline is built using Nextflow and processes data using the following steps:
- (optional) Conversion of spectra data to indexedMzML: Using ThermoRawFileParser if Thermo Raw or using OpenMS’ FileConverter if just an index is missing
- (optional) Decoy database generation for the provided DB (fasta) with OpenMS
- Database search with either MSGF+ and/or Comet through OpenMS adapters
- Re-mapping potentially identified peptides to the input database for consistency and error-checking (using OpenMS’ PeptideIndexer)
- PSM rescoring using PSMFeatureExtractor and Percolator or a PeptideProphet-like distribution fitting approach in OpenMS
- If multiple search engines were chosen, the results are combined with OpenMS’ ConsensusID
- If multiple search engines were chosen, a combined FDR is calculated
- Single run PSM/Peptide-level FDR filtering
- If localization of modifications was requested, Luciphor2 is applied via the OpenMS adapter
- Protein inference and labelfree quantification based on spectral counting or MS1 feature detection, alignment and integration with OpenMS’ ProteomicsLFQ. Performs an additional experiment-wide FDR filter on protein (and if requested peptide/PSM-level).
A rough visualization follows:
Output is by default written to the $NXF_WORKSPACE/results folder. The output consists of the following folders (follow the links for a more detailed description):
- logs (extended log files for all steps)
- pipeline_info (general nextflow infos)
- ptxqc (quality control)
Nextflow pipeline info
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
- Reports generated by Nextflow:
- Reports generated by the pipeline:
- Documentation for interpretation of results in HTML format:
- Reports generated by Nextflow:
Intermediate output for the PSM/peptide-level filtered identifications per raw/mzML file in OpenMS’ internal idXML format.
ProteomicsLFQ main output
proteomics_lfq folder contains the output of the pipeline without any statistical postprocessing.
It is available in three different formats:
A consensusXML file as the closest representation of the internal data structures generated by OpenMS. Helpful for debugging and downstream processing with OpenMS tools.
MSstats-ready quantity table
A simple tsv file ready to be read by the OpenMStoMSstats function of the MSstats R package. It should hold the same quantities as the consensusXML but rearranged in a “long” table format with additional information about the experimental design used by MSstats.
msstats folder contains MSstats’ post-processed (e.g. imputation, outlier removal) quantities and statistical
measures of significance for different tested contrasts of the given experimental design. It also includes basic plots of these results.
The results will only be available if there was more than one condition.
The mzTab from the proteomics_lfq folder with replaced normalized and imputed quantities from MSstats. This might contain less quantities since MSstats filters proteins with too many missing values.
See MSstats vignette.
See MSstats vignette for groupComparisonPlots (Heatmap, VolcanoPlot and ComparisonPlot (per protein)).
If activated, the
ptxqc folder will contain the report of the PTXQC R package based on the mzTab output of proteomicsLFQ.
See PTXQC vignette. In the report itself the calculated and visualized QC metrics are actually quite extensively described already.
PTXQC yaml config
The default yaml config used to configure the structure of the QC report. In case you need to restructure, please edit this file and re-run PTXQC manually.