13. Multiple Instance Learning-based MHC-I Epitope Prediction


Benjamin Schubert

Our group works on developing and applying methods from bioinformatics, machine learning, and combinatorial optimization to gain a deeper understanding of the immune system’s involvement in disease and to aid in the design of new immunotherapies.

What is the data science project’s research question? Aim of this project is to extend a MHC-I epitope prediction model developed in our group and quantitatively and computationally evaluate the ability of current MHC binding prediction algorithms to identify the correct MHC allele that a peptide binds to.

What data will be worked on? 

The project will make use of publicly available datasets such as those made
available by the developers of netMHCpan [1], the Immune Epitope Database (IEDB)
[2], the HLA ligand atlas, and so on.

[1] http://www.cbs.dtu.dk/suppl/immunology/NAR_NetMHCpan_NetMHCIIpan/
[2] https://www.iedb.org/
[3] https://hla-ligand-atlas.org/welcome

What tasks will this project involve?  The first task is to generate a new synthetic dataset using the resources listed above in order to create a realistic ground truth to evaluate the algorithms.
The second task is to obtain the prediction of existing MHC binding prediction
tools on the synthetic dataset. Next, these predictions should be used to rank
the algorithms in terms of their accuracy and other performance metrics. The
final task, is to train a novel multiple-instance learning deep neural network, already developed by the hosting research group, on this new dataset and evaluate it as done with the other algorithms.

What makes this project interesting to work on?  Predicting epitope presentation on surface MHC molecules is a crucial task in immunoinformatics pipelines that aim to develop epitope vaccines for cancer, infectious diseases, or to treat autoimmune diseases. The vast majority of training data is however ambiguous since it only links an epitope to several possible MHC molecules, of which only one or two actually present the epitope. Identifying these MHC molecules is important to tailor the treatment to specific patient populations, but so far no systematic computational evaluation exists. The goal of the project is therefore to shed light on this important question.

What is the expected outcome? Contribution to research paper, Contribution to software development

What infrastructure, programs and tools will be used? Can they be used remotely?  PyTorch, HMGU GPU cluster, public state-of-art MHC-I epitope prediction tool. All resources can be remotely accessed.

What skills are necessary for this project? Data analytics / statistics, Data mining / Machine learning, Deep learning

Is the data open source? Yes

Interested candidates should be at Phd level . Benjamin Schubert is looking for 1 visiting scientist, working on the project together with Emilio Dorigatti as a co-supervisor  (emilio.dorigatti@helmholtz-muenchen.de)