Using Deep Learning and Python to Cure Diseases

Matt McDonagh

Jun 30, 2024

This is a follow-up piece, if you missed Part I dig in here for the baseline:

Predicting Medicines Into Existence

June 25, 2024

Read full story

Using Deep Learning to Invent Medicine

As you will recall from Part I, I am solo attempting one of the hottest Kaggle.com competitions right now. The game is predicting protein bonding affinity… will these molecules form or not?

Drug companies are crowdsourcing different methods (machine learning models mostly) for digesting SMILES data, creating representations of the data, using those to build a prediction machine (a model) and leveraging that digital machine to find the best candidate compounds to develop in real life.

Using math to save lives.

Since the dataset is massive and unbalanced I need to:

work with a much smaller subset of the data to build my v1 pipeline
ensure my tiny sample is representative of the population across 295,000,000 rows
ensure my v1 pipeline can actually scale sufficiently to “see” enough of the training data to develop generalizability

If I build a super powerful machine at predicting a universe of data it will never encounter in real life… well, I am not going to win the competition that way.

If I try to shove every row of data into a very complicated data preprocessing pipeline, and then try to train a monster model atop this wicked pipeline, my Kaggle notebook environment will run out of runtime (12-hours), CPU allocation and RAM.

Operating in a resource constrained environment is where progress is born.

Let’s review the data we have to work with.

We have 295M rows of molecular data in a format called SMILES (Simplified Molecular Input Line Entry System) — SMILES is a chemical notation that allows modelers to represent a chemical structure in a way that can be used by the computer.

This is our entering point.

We don’t have to simply accept this training data and work with it as is. Instead we are going to transform it into various representations first. Then we’re going to build special-purpose models to focus on the different representation types before ensembling a prediction machine together.

We may even augment the data by creating synthetic entries (not real, but simulated to fit the shape and other characteristics) by rotating, filtering and otherwise manipulating our training data. This can take a small dataset and suddenly make it large enough to work well with deep learning techniques

A handy trick, but perhaps one we won’t need for today due to the opposite problem — we have too much data! That being said, the prompt for the challenge mentioned that the data is skewed with a key compound underrepresented in the data. Later on (still have 8-days to competition deadline as of this writing) I may experiment with identifying these underrepresented elements and magnifying their presence in the training data to potentially enhance predictive accuracy / generalizability.

Here are the transformations I have planned for the SMILES data.

Molecular Graphs

Molecular graphs represent molecules as graphs where atoms are nodes and bonds are edges. To convert SMILES strings into molecular graphs, we first use RDKit to parse the SMILES and generate an RDKit molecule object.

Molecule Graphs: New in Wolfram Language 12

Each atom in the molecule is represented as a node, and each bond between atoms is represented as an edge. Features for each atom include its atomic number, degree, formal charge, hybridization state, number of hydrogen atoms, and aromaticity. Similarly, bond features include bond type, whether the bond is conjugated, and whether it is part of a ring. The resulting graph can be described using an adjacency matrix, representing the connections between atoms, and feature matrices for nodes and edges. This representation is particularly useful for models like Graph Neural Networks (GNNs), which can learn complex interactions between atoms and capture the overall molecular structure.

If you read Part I you know we use a particular type of GNN called a GAT, or Graph Attention Network to learn the graph data.

Molecular Fingerprints

Molecular fingerprints are fixed-length binary vectors that encode the presence or absence of specific substructures within a molecule. To generate fingerprints, tools like RDKit can be used to transform a molecule into a bit vector using algorithms such as Morgan (circular) fingerprints.

Each bit in the vector represents whether a particular substructure or chemical environment is present in the molecule. These fingerprints are compact yet powerful representations that capture key structural features of molecules. They are widely used in cheminformatics for tasks like similarity searching, clustering, and structure-activity relationship modeling. In machine learning, molecular fingerprints serve as input features for traditional algorithms like Random Forests, SVMs, or neural networks, enabling the models to leverage the encoded structural information for predicting properties or activities of molecules.

Our weapon of choice here is the GRU or Gated Recurrent Unit. A much deeper memory. I may opt for LTSM before submitting the final notebook for review.

Molecular Descriptors

Molecular descriptors are numerical values that describe various chemical properties of molecules. These properties include molecular weight, hydrophobicity (logP), topological polar surface area (TPSA), number of hydrogen bond donors and acceptors, and many others. RDKit and other cheminformatics tools can compute these descriptors directly from molecular structures. Descriptors provide a way to quantify chemical characteristics that can influence a molecule’s behavior and interactions, making them valuable features for predictive modeling. By capturing diverse aspects of molecular properties, descriptors help in building models that can predict biological activities, solubility, toxicity, and other relevant outcomes. They offer a complementary approach to structural representations, enabling a more comprehensive understanding of molecular behavior.

Not sure if I will have enough time to work descriptors into the competition. Going to start with the GAT/GRU architecture and if my resources are sufficient I will circle back and layer a SVM trained on the molecular descriptors or perhaps XGBoost.

The real magic doesn’t come from cramming a million features into the model. Selecting the right ones, engineering them to enhance their value and then feeding it all into a well-designed model is what makes a project successful.

There are several strategies for ensembling these models to make predictions.

Here are some common approaches.

1) Late Fusion (Decision-Level Fusion)

In this approach, each model makes independent predictions based on its respective inputs, and then the predictions are combined at the end, typically through simple averaging or a weighted average if you have reason to trust one model more than the other. More complex fusion techniques can also be used, such as training a secondary model (such as a logistic regression or a small neural network) to combine these predictions optimally.

Example: If the GAT predicts a binding affinity of 5.0 and the GRU predicts 4.8, you might average these to get a final prediction of 4.9.

2) Feature-Level Fusion

Here, instead of combining predictions, you combine the features learned by each model before making a prediction. This could involve concatenating the final layer outputs of both models and feeding them into one or more additional layers of a neural network to produce a final prediction. This method leverages the different perspectives each model has learned about the data.

Example: The output of the GAT and the GRU could be concatenated and then passed through a dense layer (or layers) to predict the final output.

3) Hybrid Models

This approach involves integrating the architectures themselves at an earlier stage than output-level fusion. For example, the output of the GAT could be used as an input to the GRU, assuming that there is a sequential aspect to the graph output that the GRU can model effectively. This is more complex and requires careful architectural planning to ensure the data remains meaningful across model boundaries.

Example: Use the node embeddings generated by the GAT as sequential inputs to the GRU, perhaps treating embeddings from specific paths or subgraphs as sequences.

4) Boosting or Bagging

Using ensemble methods like boosting or bagging can help improve the robustness of your predictions. For instance, you can train multiple instances of each model under slightly different configurations or subsamples of the data, then average their predictions.

Example: Train several GATs and GRUs on different subsets of the data (bagging) or by focusing on different aspects of the error (boosting), and then combine their outputs.

5) Stacking

Stacking involves training a new model to combine the predictions of the GAT and GRU. The first-level models (GAT and GRU) output predictions that are used as inputs to a second-level model, which learns to optimally combine these predictions. This second-level model could be something simple like a linear regressor, or more complex like another neural network.

Example: Use the predictions of both the GAT and GRU as features for a secondary predictor that outputs the final prediction.

Choosing the Right Strategy

The best choice depends on your specific dataset, the nature of the SMILES and the complexity of the molecular interactions with the protein.

Late fusion is often the simplest to implement and can surprisingly be very effective.

Feature-level fusion and hybrid models might yield better performance but require more intricate modifications to the architecture and more computational resources.

As mentioned, I am starting simple and will layer complexity as resources allow. Time most importantly.

The (Current) Plan

I wrote at length about the first (and second) plan in Part I.

Right now I’m up to the 9th or 10th plan… The current leader in the competition has over 300 submissions!

Problem Statement:

Given a large dataset of small molecules represented by SMILES strings, the goal is to predict whether these molecules will bind to specific protein targets. This is a binary classification problem with an imbalanced dataset, as binders are relatively rare.

Dataset Understanding:

Training Data: ~98 million examples per protein target (3 targets total)
Validation Data: 200K examples per protein target
Test Data: 360K molecules per protein target (includes novel building blocks)
Features: SMILES strings for building blocks and full molecules, protein target name
Target: Binary label 'binds' (1 for binding, 0 for not binding)

Evaluation Metric:

Average Precision (AP) calculated per protein target and split group, then averaged for the final score.

Current Plan:

Start with Data Preparation:
- Sampling: Use DuckDB to efficiently sample a manageable subset of the training data for initial development and experimentation.
- SMILES Processing:
  - Convert SMILES strings to molecular graphs
  - Generate molecular fingerprints
- Feature Engineering:
  - Layering additional features derived from graphs (e.g., node degrees, graph invariants) and fingerprints (e.g., bit counts, similarity measures)
  - Encode protein target names using one-hot encoding or embeddings.
Design the Model Architecture:
- Graph Representation: Use a Graph Attention Network to learn representations from molecular graphs. GATs can capture complex relationships between atoms and bonds.
- Fingerprint Representation: Use a Gated Recurrent Unit network to learn representations from molecular fingerprints. GRUs can model sequential relationships within fingerprints.
- Ensemble: Combine the outputs of the GAT and GRU models
- Output Layer: A single neuron with a sigmoid activation for binary classification.

There are heavy data imbalances I need to mitigate in data preprocessing and training. Going to “use my tools” to avoid overfitting on the small subset of the training data I will be computationally able to use.

Cross-validation, regularization, early stopping and the rest will be utilized.

Optimization Path Going Forward

Exciting news! During my drafting of this piece I finally got a working prediction pipeline!

The bad news: my entry is ranked roughly 1600th out of 1900.

The good news: optimization has not yet begun!

Both the GAT and GRU are extremely simple. Going to add layers.

I also plan on bolting-on more features to the smiles_to_graph and smiles_to_fingerprint functions.

Off to work.

Enjoy Part III.

Part III is published here… Part IV is after the competition concludes!

Predicting Medicine into Existence - Part III

July 5, 2024

As mentioned in Part I and Part II our current mission is to use molecular data to predict potential compounds. We’re doing this to predict medicines. Last time out we got our prediction pipeline flowing and submitted the first official competition submission.