MaveDB Amino Acid Substitution Prediction

Summary: Understanding the functional consequences of genetic variants is a cornerstone of modern genomics. Multiplexed Assays of Variant Effect (MAVEs) provide high-throughput measurements of how thousands of variants impact gene function, and MaveDB is a growing repository of these datasets, but it is still expensive and time-consuming to experimentally test every variant. In this challenge, your task is to develop a machine learning model that can predict the functional outcomes of amino acid substitutions.

Method areas: Regression models, foundation models, embeddings, zero-shot prediction, protein language models.

Prerequisites: A solid grasp of Python and machine learning fundamentals is expected. Familiarity with PyTorch and Hugging Face is highly recommended.

In this challenge, your task is to develop machine learning models that can predict the functional outcomes of amino acid substitutions.

Background

Proteins are the molecular machines that living cells rely on for various functions, including structural support, signaling, and metabolism. They are made up of chains of amino acid residues, which makes it possible to represent a protein as a sequence of symbols, where each symbol represents an amino acid, for example:

MPLYSVTVKWGKEKFEGVELNTDEPPMVFKAQLFALTGVQP

What happens to the protein’s function if one of the amino acid residues is substituted for another?

MPLYSVTVK(W)GKEKFEGVELNTDEPPMVFKAQLFALTGVQP

                           ↓

MPLYSVTVK(D)GKEKFEGVELNTDEPPMVFKAQLFALTGVQP

Here, the reference amino acid, Tryptophan (W) at position 10 is substituted with an alternate amino acid Aspartic Acid (D). Does this change the function of the protein? And if so, how?

The Multiplexed Assays of Variant Effect (MAVEs) from MaveDB that we use in this challenge are experiments in which many such substitutions are tested, and their effects are measured in various ways. The measurements can be something as general as cell growth, to something as specific as the rate of transport of a particular molecule. Ultimately, each substitution in each MAVE gets a single numerical score representing the functional change, but note that the same number can mean different things in different experimental studies!

Prediction task

The model you develop will need to be able to predict the MAVE numerical score of a substitution given the reference sequence, position of the amino acid substituted, and the new alternate amino acid. You are encouraged to use any additional external information you find helpful for your model.

Data will be available in September at https://www.kaggle.com/competitions/mave-db-amino-acid-substitution-prediction/

For each amino acid substitution in the training/test set, we provide:

  • The MaveDB scoreset accession. You can use this to find additional auxiliary information about the protein studied and measurement technique used to help encode context in features for your model.
  • The accession of the reference amino acid sequence. You can use this to look up the full amino acid sequence to use as input for a language model.
  • The position of the substitution in the reference sequence.
  • The reference amino acid at the position.
  • The alternate amino acid at the position.
  • In the training set only: the numerical measurement of protein function from the experiment (as a single score).

A note on using external data

Since both the training and test data is available in MaveDB, we ask the participants to take care not to use any of the test set numerical measurements from MaveDB or the associated publications. Other than that, bringing in external data to enrich the features available to the model is encouraged.

Required

  • Version Control with GitHub Desktop: All hackathon attendees must know how to work in Git/GitHub. Click the link if you need a quick refresher (1 hour tutorial) before the hackathon kicks off in September!
  • Intro to Python: The go-to ecosystem for protein language models uses Python with PyTorch. You’re welcome to work in another language if you have an idea for a different approach and the consent of your team-members.

Recommended

  • The huggingface model hub is a great resource for accessing pre-trained models
  • Zero-shot classification: Using large pre-trained models on novel tasks. These resources are written with a focus on LLMs, but many of the same ideas apply to protein language models.
  • ESM Cambrian: a state-of-the-art protein language model.

The training and test sets will be available on Kaggle after the launch date on Sept. 1, 2025: https://www.kaggle.com/competitions/mave-db-amino-acid-substitution-prediction/

If you have any questions about participating, please contact the challenge organizer: Yuriy Sverchkov (sverchkov@wisc.edu).