nf-core/epitopeprediction
A bioinformatics best-practice analysis pipeline for epitope prediction and annotation
Introduction
This document describes the output produced by the pipeline. The version of all tools used in the pipeline are summarized in a MultiQC report which is generated at the end of the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Variant prediction
Epytope is used to parse annotated variants (by [SnpEff](http://pcingola.> github.io/SnpEff/) or VEP). Based on this information, epytope generates all possible mutated peptides within the length boundary set by --min_peptide_length_class[I|II]
and --max_peptide_length_class[I|II]
. Essentially the same peptide generation from proteins is applied when specifying .fasta
files in the samplesheet.
Example: Suppose you have the missense mutation p.Cys138Tyr
in ENSP00000235347
and you set min_peptide_length_class[I|II] = max_peptide_length_class[I|II] = 9
. A subset of the table epytope generates looks like this:
Mutated | Wildtype | Metadata |
---|---|---|
SKRQTVEDY | SKRQTVEDC | … |
KRQTVEDYP | KRQTVEDCP | … |
RQTVEDYPR | RQTVEDCPR | … |
QTVEDYPRM | QTVEDCPRM | … |
TVEDYPRMG | TVEDCPRMG | … |
VEDYPRMGE | VEDCPRMGE | … |
EDYPRMGEH | EDCPRMGEH | … |
DYPRMGEHQ | DCPRMGEHQ | … |
YPRMGEHQP | CPRMGEHQP | … |
Tables are written per chromosome in a tsv
.
Output directory: epytope/[sample]_chr[1-22|X|Y].tsv
These generated mutated peptides are then passed to the MHC binding prediction subworkflow, where they are scored against the sample’s individual MHC typing.
Optionally you can obtain the full mutated and wildtype protein sequence of ENSP00000235347
by providing --fasta_output
. This can be specificall useful as input database for mass spectrometry-based pipelines such as nf-core/mhcquant
Output directory: epytope/[sample].fasta
Epitopeprediction
Depending on the specified predictor(s) in --tools
, the tools individual binding prediction files are written in the respective directories. The number of input peptides for the MHC binding subworkflow is splitted into chunks to enable scalability.
The chunksize is controlled by --peptides_split_minchunksize
and --peptides_split_maxchunks
.
Tools output directory:
mhcflurry/[sample]_chunk_[0-9]_predicted_mhcflurry.csv
mhcnuggets/[sample]_chunk_[0-9]_predicted_mhcnuggets.csv
mhcnuggetsii/[sample]_chunk_[0-9]_predicted_mhcnuggetsii.csv
netmhcpan/[sample]_chunk_[0-9]_predicted_netmhcpan.xls
netmhciipan/[sample]_chunk_[0-9]_predicted_netmhciipan.xls
These predictor-specific output files are harmonized and chunks are merged on the sample
information of your samplesheet.
Output directory: predictions/[sample].tsv
.
Output files always contain the columns --peptide_col_name
(default:‘sequence’), allele
, BA
, rank
, binder
, predictor
. All further metadata columns are parsed into the output files.
An example prediction result looks like this in TSV format:
metadata | sequence | allele | BA | rank | binder | predictor |
---|---|---|---|---|---|---|
peptide1 | RLDSHLHTHVY | HLA-A*01:01 | 0.416 | 0.1215 | True | netmhcpan |
peptide1 | RLDSHLHTHVY | HLA-A*01:01 | 0.3873 | 0.0007 | False | mhcnuggets |
peptide1 | RLDSHLHTHVY | HLA-A*01:01 | 0.6072 | 0.0465 | True | mhcflurry |
peptide1 | RLDSHLHTHVY | HLA-A*01:01 | 0.6072 | 0.0465 | True | mhcflurry |
peptide2 | VTAVIRSRRY | HLA-A*68:01 | 0.3189 | 0.7457 | True | netmhcpan |
peptide2 | VTAVIRSRRY | |||||
peptide2 | VTAVIRSRRY | HLA-A*68:01 | 0.3455 | 2.5875 | False | mhcflurry |
The prediction results are given as allele-specific Binding Affinity (BA) and percentile ranks (rank) per peptide. The computation of these values depends on the applied prediction method. Binding Affinity represents the predicted strength of the interaction between a peptide and an MHC molecule. It is derived from the predicted IC50 value (in nanomolar, nM) and normalized to a scale between 0 and 1 using the formula:
where aff is the predicted IC50 binding affinity. Lower IC50 values indicate stronger binding, with peptides having IC50 values below 500 nM typically considered strong binders.
Percentile rank (rank) indicates the relative binding strength of a peptide compared to a large set of random natural peptides. This measure is not affected by inherent biases of certain MHC molecules towards higher or lower mean predicted affinities. Strong binders are defined as having rank < 0.5, and weak binders with rank < 2. For example, a peptide with a rank of 0.1 is among the top 0.1% of best binders. This approach ensures a more consistent selection across different MHC alleles, as it accounts for variability in binding thresholds. It is advised to select candidate binders based on rank rather than binding affinities. Consequently, the binder
column is defined based on the rank. An exception to this is the percentile rank computation of MHCnuggets, which is considered experimental and therefore it is implemented and advised to use the BA
column for the binder definition.
Output files can contain empty spaces, which indicate that one of the provided predictors does not support the provided allele and/or peptide length. A curated list of supported alleles can be found under assets/supported_alleles.json
. The number of peptides that could not be predicted due to unsupported alleles or peptide lengths is documented in the MultiQC report. See Usage for predictor boundaries.
Optionally you can provide --wide_format_output
to obtain your results in wide format.
An example of the wide format looks like this:
metadata | sequence | allele | netmhcpan_BA | netmhcpan_rank | netmhcpan_binder | mhcnuggets_BA | mhcnuggets_rank | mhcnuggets_binder | mhcflurry_BA | mhcflurry_rank | mhcflurry_binder |
---|---|---|---|---|---|---|---|---|---|---|---|
peptide1 | RLDSHLHTHVY | HLA-A*01:01 | 0.416 | 0.1215 | True | 0.3873 | 0.0007 | False | 0.6072 | 0.0465 | True |
peptide2 | VTAVIRSRRY | HLA-A*68:01 | 0.3189 | 0.7457 | True | 0.3455 | 2.5875 | False |
MultiQC
Binding prediction results are summarized into tables, such as the number of binders/non-binders. Binding prediction score distributions are also highlighted to give the user an appropriate overview of the binding prediction results.
Output directory: multiqc/
multiqc_data/
- Underlying data to generate MultiQC plots
multiqc_plots/
- Plots in
pdf
,png
, andsvg
format that are part of the MultiQC report
- Plots in
multiqc_report.html
- The main multiQC report comprising statistics and distributions of hte binding prediction results.
For more information about how to use MultiQC reports, see http://multiqc.info.
Pipeline information
Output files
-
pipeline_info/
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.html
. - Reports generated by the pipeline:
software_versions.yml
. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
. - Parameters used by the pipeline run:
params.json
.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.