Introduction

This document describes the output produced by the pipeline. The version of all tools used in the pipeline are summarized in a MultiQC report which is generated at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Variant prediction

Epytope is used to parse annotated variants (by [SnpEff](http://pcingola.> github.io/SnpEff/) or VEP). Based on this information, epytope generates all possible mutated peptides within the length boundary set by --min_peptide_length_class[I|II] and --max_peptide_length_class[I|II]. Essentially the same peptide generation from proteins is applied when specifying .fasta files in the samplesheet.

Example: Suppose you have the missense mutation p.Cys138Tyr in ENSP00000235347 and you set min_peptide_length_class[I|II] = max_peptide_length_class[I|II] = 9. A subset of the table epytope generates looks like this:

MutatedWildtypeMetadata
SKRQTVEDYSKRQTVEDC
KRQTVEDYPKRQTVEDCP
RQTVEDYPRRQTVEDCPR
QTVEDYPRMQTVEDCPRM
TVEDYPRMGTVEDCPRMG
VEDYPRMGEVEDCPRMGE
EDYPRMGEHEDCPRMGEH
DYPRMGEHQDCPRMGEHQ
YPRMGEHQPCPRMGEHQP

Tables are written per chromosome in a tsv.

Output directory: epytope/[sample]_chr[1-22|X|Y].tsv

These generated mutated peptides are then passed to the MHC binding prediction subworkflow, where they are scored against the sample’s individual MHC typing.

Optionally you can obtain the full mutated and wildtype protein sequence of ENSP00000235347 by providing --fasta_output. This can be specificall useful as input database for mass spectrometry-based pipelines such as nf-core/mhcquant

Output directory: epytope/[sample].fasta

Epitopeprediction

Depending on the specified predictor(s) in --tools, the tools individual binding prediction files are written in the respective directories. The number of input peptides for the MHC binding subworkflow is splitted into chunks to enable scalability. The chunksize is controlled by --peptides_split_minchunksize and --peptides_split_maxchunks.

Tools output directory:

  • mhcflurry/[sample]_chunk_[0-9]_predicted_mhcflurry.csv
  • mhcnuggets/[sample]_chunk_[0-9]_predicted_mhcnuggets.csv
  • mhcnuggetsii/[sample]_chunk_[0-9]_predicted_mhcnuggetsii.csv
  • netmhcpan/[sample]_chunk_[0-9]_predicted_netmhcpan.xls
  • netmhciipan/[sample]_chunk_[0-9]_predicted_netmhciipan.xls

These predictor-specific output files are harmonized and chunks are merged on the sample information of your samplesheet.

Output directory: predictions/[sample].tsv.

Output files always contain the columns --peptide_col_name (default:‘sequence’), allele, BA, rank, binder, predictor. All further metadata columns are parsed into the output files.

An example prediction result looks like this in TSV format:

metadatasequencealleleBArankbinderpredictor
peptide1RLDSHLHTHVYHLA-A*01:010.4160.1215Truenetmhcpan
peptide1RLDSHLHTHVYHLA-A*01:010.38730.0007Falsemhcnuggets
peptide1RLDSHLHTHVYHLA-A*01:010.60720.0465Truemhcflurry
peptide1RLDSHLHTHVYHLA-A*01:010.60720.0465Truemhcflurry
peptide2VTAVIRSRRYHLA-A*68:010.31890.7457Truenetmhcpan
peptide2VTAVIRSRRY
peptide2VTAVIRSRRYHLA-A*68:010.34552.5875Falsemhcflurry

The prediction results are given as allele-specific Binding Affinity (BA) and percentile ranks (rank) per peptide. The computation of these values depends on the applied prediction method. Binding Affinity represents the predicted strength of the interaction between a peptide and an MHC molecule. It is derived from the predicted IC50 value (in nanomolar, nM) and normalized to a scale between 0 and 1 using the formula:

BA=1log10(aff)log10(50000)BA = 1 - \frac{\log_{10}(\text{aff})}{\log_{10}(50000)}

where aff is the predicted IC50 binding affinity. Lower IC50 values indicate stronger binding, with peptides having IC50 values below 500 nM typically considered strong binders.

Percentile rank (rank) indicates the relative binding strength of a peptide compared to a large set of random natural peptides. This measure is not affected by inherent biases of certain MHC molecules towards higher or lower mean predicted affinities. Strong binders are defined as having rank < 0.5, and weak binders with rank < 2. For example, a peptide with a rank of 0.1 is among the top 0.1% of best binders. This approach ensures a more consistent selection across different MHC alleles, as it accounts for variability in binding thresholds. It is advised to select candidate binders based on rank rather than binding affinities. Consequently, the binder column is defined based on the rank. An exception to this is the percentile rank computation of MHCnuggets, which is considered experimental and therefore it is implemented and advised to use the BA column for the binder definition.

Note

Output files can contain empty spaces, which indicate that one of the provided predictors does not support the provided allele and/or peptide length. A curated list of supported alleles can be found under assets/supported_alleles.json. The number of peptides that could not be predicted due to unsupported alleles or peptide lengths is documented in the MultiQC report. See Usage for predictor boundaries.

Optionally you can provide --wide_format_output to obtain your results in wide format.

An example of the wide format looks like this:

metadatasequenceallelenetmhcpan_BAnetmhcpan_ranknetmhcpan_bindermhcnuggets_BAmhcnuggets_rankmhcnuggets_bindermhcflurry_BAmhcflurry_rankmhcflurry_binder
peptide1RLDSHLHTHVYHLA-A*01:010.4160.1215True0.38730.0007False0.60720.0465True
peptide2VTAVIRSRRYHLA-A*68:010.31890.7457True0.34552.5875False

MultiQC

Binding prediction results are summarized into tables, such as the number of binders/non-binders. Binding prediction score distributions are also highlighted to give the user an appropriate overview of the binding prediction results.

Output directory: multiqc/

  • multiqc_data/
    • Underlying data to generate MultiQC plots
  • multiqc_plots/
    • Plots in pdf, png, and svg format that are part of the MultiQC report
  • multiqc_report.html
    • The main multiQC report comprising statistics and distributions of hte binding prediction results.

For more information about how to use MultiQC reports, see http://multiqc.info.

Pipeline information

Output files
  • pipeline_info/

    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.html.
    • Reports generated by the pipeline: software_versions.yml.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
    • Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.