Introduction

This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

General

Output files
  • *.mzTab
  • *.tsv

The mzTab output file follows the a HUPO-PSI format and combines all information of the sample-condition group extracted from a database search throughout the pipeline. A detailed explanation of the respective entries are elaborately explained here. MzTab files are compatible with the PRIDE Archive - proteomics data repository and can be uploaded as search files.

MzTab files contain many columns and annotate the most important information - here are a few outpointed:

PEP sequence accession best_search_engine_score[1] retention_time charge mass_to_charge peptide_abundance_study_variable[1]

Most important to know is that in this format we annotated the Comet XCorr of each peptide identification in the best_search_engine_score[1] column and peptide quantities in the peptide_abundance_study_variable columns. If --skip_quantification is specified the best_search_engine_score[1] holds the percolator q-value.

The TSV output file is an alternative output of OpenMS comprising similar information to the mzTab output. A brief explanation of the structure is listed below. See documentation of the format or PSI documentation for more information about annotated scores and format.

MAP contains information about the different mzML files that were provided initially

#MAP id filename label size

RUN contains information about the search that was performed on each run

#RUN run_id score_type score_direction date_time search_engine_version parameters

PROTEIN contains information about the protein ids corresponding to the peptides that were detected (No protein inference was performed)

#PROTEIN score rank accession protein_description coverage sequence

UNASSIGNEDPEPTIDE contains information about PSMs that were identified but couldn’t be quantified to a precursor feature on MS Level 1

#UNASSIGNEDPEPTIDE rt mz score rank sequence charge aa_before aa_after score_type search_identifier accessions FFId_category feature_id file_origin map_index spectrum_reference COMET:IonFrac COMET:deltCn COMET:deltLCn COMET:lnExpect COMET:lnNumSP COMET:lnRankSP MS:1001491 MS:1001492 MS:1001493 MS:1002252 MS:1002253 MS:1002254 MS:1002255 MS:1002256 MS:1002257 MS:1002258 MS:1002259 num_matched_peptides protein_references target_decoy

CONSENSUS contains information about precursor features that were identified in multiple runs (eg. run 1-3 in this case)

#CONSENSUS rt_cf mz_cf intensity_cf charge_cf width_cf quality_cf rt_0 mz_0 intensity_0 charge_0 width_0 rt_1 mz_1 intensity_1 charge_1 width_1 rt_2 mz_2 intensity_2 charge_2 width_2 rt_3 mz_3 intensity_3 charge_3 width_3

PEPTIDE contains information about peptide hits that were identified and correspond to the consensus features described below

#PEPTIDE rt mz score rank sequence charge aa_before aa_after score_type search_identifier accessions FFId_category fea

Intermediate results

This folder contains the intermediate results from various steps of the MHCquant pipeline (e.g. (un)filtered PSMs, aligned mzMLs, features)

Output files
  • intermediate_results/

    • alignment: Contains the trafoXML files of each run that document the retention time shift after alignment in quantification mode.

    • comet: Contains pin files generated by comet after database search

    • percolator

      • {Sample}_{Condition}_psm.idXML: File holding extra features that will be used by percolator. Created by PSMFeatureExtractor.
      • {Sample}_{Condition}_pout.idXML: Unfiltered percolator output.
      • {Sample}_{Condition}_pout_filtered.idXML: FDR-filtered percolator output.
    • features: Holds information of quantified features in featureXML files as a result of the FeatureFinderIdentification in the quantification mode.

  • ion_annotations

    • {Sample}_{Condition}_all_peaks.tsv: Contains metadata of all measured ions of peptides reported after peptide identification.

    • {Sample}_{Condition}_matching_ions.tsv: Contains ion annotations and additional metadata of peptides reported after peptide identification.

  • refined_fdr (Only if --refine_fdr_on_predicted_subset is specified)

    • *merged_psm_perc_filtered.mzTab : This file export filtered percolator results (by q-value) as mzTab.

    • *_all_ids_merged.mzTab : Exportas all of the psm results as mztab.

    • *perc_subset.idXML : This file is the outcome of a second OpenMS PercolatorAdapter run.

    • *pred_filtered.idXML : Contains filtered PSMs prediction results by shrinked search space (outcome mhcflurry).

    • {ID}_-_{filename}_filtered : An outcome file of OPENMS_IDFILTER_REFINED.

VCF

Reference fasta

Output files
  • *_vcf.fasta: If --include_proteins_from_vcf is specified, then this fasta is created for the respective sample
The fasta database including mutated proteins used for the database search

Neoepitopes

These CSV files list all of the theoretically possible neoepitope sequences from the variants specified in the vcf and neoepitopes that are found during the mass spectrometry search, independant of binding predictions, respectively

Found neoepitopes

Output files
  • class_1_bindings/

    • *found_neoepitopes_class1.csv: Generated when --include_proteins_from_vcf and --predict_class_1 are specified
  • class_2_bindings/

    • *found_neoepitopes_class2.csv: Generated when --include_proteins_from_vcf and --predict_class_2 are specified

This CSV lists all neoepitopes that are found during the mass spectrometry search, independant of binding predictions. The format is as follows:

peptide sequence geneID

vcf_neoepitopes

Output files
  • class_1_bindings/

  • *vcf_neoepitopes_class1.csv: Generated when --include_proteins_from_vcf and --predict_class_1 are specified

  • class_2_bindings/

  • *vcf_neoepitopes_class2.csv: Generated when --include_proteins_from_vcf and --predict_class_2 are specified

This CSV file contains all theoretically possible neoepitope sequences from the variants that were specified in the vcf. The format is shown below

Sequence Antigen ID Variants

Class prediction

Class (1|2) bindings

Output files
  • class_1_bindings/

  • *predicted_peptides_class_1.csv: If --predict_class_1 is specified, then this CSV is generated

  • class_2_bindings/

  • *predicted_peptides_class_2.csv: If --predict_class_2 is specified, then this CSV is generated

This folder contains the binding predictions of all detected class 1 or 2 peptides and all theoretically possible neoepitope sequences The prediction outputs are comma-separated table (CSV) for each allele, listing each peptide sequence and its corresponding predicted affinity scores:

peptide allele prediction prediction_low prediction_high prediction_percentile

MultiQC

Output files
  • multiqc/

  • multiqc_report.html: a standalone HTML file that can be viewed in your web browser.

  • multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.

  • multiqc_plots/: directory containing static images from the report in various formats.

MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Pipeline information

Output files
  • pipeline_info/

    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.html.
    • Reports generated by the pipeline: software_versions.yml.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
    • Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.