Predicting Medicines Into Existence
This is the first in a four-part series.
Part II - Using Deep Learning and Python to Cure Diseases
Part III - Predicting Medicine into Existence
Part IV - Predicting Medicine into Existence (Part IV)
In previous articles I’ve discussed using AI to discover proteins by folding them in computer simulations.
Now its time to show the power of AI to elevate the human condition.
We are going hands-on with a special edition of Life in the Singularity — last weekend I joined an international competition with this goal: use the power of machine learning to predict small molecule-protein interactions using the Big Encoded Library for Chemical Assessment (BELKA).
We’re using real data to predict potential medicines!
I’m a solo competitor who joined the 4-month competition with 18 days remaining.
I like a challenge, but this is a bit much even for me.
Speaking of, nearly 8,500 other gladiators have entered the arena. Some are solo workers, most have teamed up. These folks are CS professors with hundreds of research papers to their credit and teams from Google, Microsoft and several other Billion-dollar enterprises.
Importantly there have already been 24,000 submissions. The other competitors are getting more “at-bats” and getting feedback, using that feedback to improve their:
data preprocessing
feature engineering
hyperparameter tuning
learning rate optimization
regularization
…and everything else related to deploying ML
The head start is valuable because there is a lot of ground to cover!
Small molecule drugs are fascinating chemicals that interact with cellular protein machinery, altering its functions in various ways. These drugs aim to inhibit the activity of specific protein targets believed to be involved in disease processes. The work of identifying these candidate molecules is both challenging and intriguing.
The classic approach involves physically synthesizing these molecules one by one. Researchers then expose them to the protein target of interest, meticulously testing to see if they interact. This method is laborious and consumes a significant amount of time and resources.
The US Food and Drug Administration has approved approximately 2,000 novel molecular entities throughout its history. The number sounds big until you realize it pales in comparison to the vastness of the potential pharmaceutical chemical space, which is estimated to contain around 10^60 possible compounds. To put this into perspective, this number is unimaginably large—it's a 1 followed by 60 zeros.
Consider this analogy to understand how vast we are talking: if each possible drug-like molecule were the size of a grain of sand, to represent the total number of potential molecules, you would need a volume of sand grains equivalent more than 1 billion Earths.
One. Billion. Earths.
The overwhelming enormity of this chemical universe presents a staggering challenge. It is simply too vast to be explored through physical experimentation alone.
Society needs a solution because hidden within this immense chemical space are likely effective treatments for various human ailments. The potential to discover new, life-changing drugs is enormous, but the current methods of searching this space are inadequate. The difficulty lies in the sheer scale—searching through 10^60 compounds is like finding a microscopic needle in a cosmic haystack the size of the solar system.
How The Competition Works
To evaluate potential search methods in small molecule chemistry, competition host Leash Biosciences physically tested 133,000,000,000 small molecules for their ability to interact with one of three protein targets. This dataset, the Big Encoded Library for Chemical Assessment (BELKA) provides an excellent opportunity to develop predictive models that may advance drug discovery.
Datasets of this size are rare and very valuable, so pharma companies keep them internal. It’s incredible to work with data of this size and value.
The current best-curated public dataset of this kind is bindingdb - it contains 2,800,000 binding measurements, much smaller than BELKA.
This competition aims to revolutionize small molecule binding prediction by harnessing ML techniques. From the competition page:
Recent advances in ML approaches suggest it might be possible to search chemical space by inference using well-trained computational models rather than running laboratory experiments. Similar progress in other fields suggest using ML to search across vast spaces could be a generalizable approach applicable to many domains. We hope that by providing BELKA we will democratize aspects of computational drug discovery and assist the community in finding new lifesaving medicines.
Here, you’ll build predictive models to estimate the binding affinity of unknown chemical compounds to specified protein targets.
Your work will contribute to advances in small molecule chemistry used to accelerate drug discovery.
Incredible that we can use ML to elevate the human condition so directly.
The Plan
Since there’s less than 2-weeks until I need to have my machine built (need time to optimize) I must navigate this process quickly.
You always start by asking yourself: what is the objective of this model?
Then you go into the data:
Where does the data come from?
What volume, velocity, value, variability, veracity & variety can we expect?
Is this data clean or will it require preprocessing?
What do the exploratory data analysis results look like?
this drives what data you will want to augment…
what sort of models you will want to build…
which features you will engineer…
With these pieces you can begin to shape your plan.
My first plan was over-engineered to the Nth degree, feast your eyes:
Seriously consider just how much faith I must have in my abilities to suggest this pipeline and architecture…
Yeah, let’s call it faith.
This was my original plan:
transform the SMILES data into 4 different representations
preprocess each representation
model each independently with different architectures
CNNs for the molecular fingerprints & fragments
TabNet for the molecular descriptors
MPNN (Message Passing Neural Network) for the molecular graph
overlay a cross-attention layer so the models inform each other’s progress
train a meta-model leveraging those cross-attended outputs
generate the predictions and submit them
I got tired typing that, never-mind a single person actually getting each of these components up, running and talking to each other before the competition.
No, to get anything done in time I needed a much simpler layout.
Instead of building a castle on the first go I am going to build a cottage (a small working model) and then upgrade it on nights and weekends until the competition.
Back to the Drawing Board
Rather than having four models we’re going to simplify down to just two.
This isn’t just about complication, we’re also going to dramatically scale back the complexity of the model architecture and use a Graph Attention Network alongside a Gated Recurrent Unit, which is a form of Recurrent Neural Network.
Once we get predictions flowing out of the pipeline then I’ll start climbing the complexity and complication curves again in search of performance.
Basic Project Outline
Environment Setup (in Kaggle):
Ensure we have the necessary libraries installed:
RDKit: For SMILES manipulation, molecular graph generation, and fingerprint calculation.
Deep Learning Frameworks: TensorFlow and Keras
Graph Neural Network Libraries: PyTorch Geometric
Data Preparation:
Load the provided SMILES data into Pandas DataFrames.
Molecular Graph Generation (GAT Input):
Use RDKit to convert SMILES strings into molecular graphs.
Molecular Fingerprint Calculation (GRU Input):
Use RDKit to generate molecular fingerprints
Split data into training, validation, and test sets.
Baseline Model Development:
GAT Model:
Start with a simple Graph Attention Network (GAT).
Use the molecular graphs as input and predict the binding affinity.
GRU Model:
Begin with a basic Gated Recurrent Unit (GRU)
Input the molecular fingerprints and predict the binding affinity.
Model Training and Evaluation:
Train both models separately using the training set.
Evaluate performance on the validation set using average precision.
Monitor loss curves and other relevant metrics.
Model Combination:
Experiment with different methods for combining GAT and GRU predictions:
Simple averaging of predictions.
Weighted averaging based on model confidence.
Train a meta-model to combine predictions.
Evaluate the combined model's performance on the validation set.
Hyperparameter Optimization:
Fine-tune hyperparameters for both the GAT and GRU models (e.g., learning rate, number of layers, activation functions).
Explore different architectures for both models (e.g., deeper GATs, LTSM to replace GRU, etc...).
(Time Allowing Step) - Create Competitors:
Create 2nd model with experimental hyperparameters
Create 3rd model with novel augmented data
molecular fingerprints, molecular graphs, molecular descriptors
Create 4th model with 3 or more neural networks working together
Final Prediction and Submission:
Train the best model on the full training data.
Make predictions on the test set and format the submission file.
Submit the predictions to Kaggle.
This plan is much more manageable. As you can see, I’ve allocated time to build alternative models and explore different configurations, and I hope I get to use it!
There are several layers of optimization we could engineer to increase our accuracy once I start generating predictions successfully, such as:
Graph Augmentation: Augment the molecular graph data (e.g., node/edge perturbations, subgraph sampling).
Attention Mechanisms: Incorporate attention mechanisms in the GAT to learn which parts of the molecule are most relevant for binding affinity.
Data Preprocessing: Experiment with different feature scaling or normalization techniques for the fingerprint data.
That’s when the cottage turns into a castle.
Of course, we aren’t really building cottages or castles. We’re building a machine that will accelerate the discovery of drugs, and spread healing at faster rates to more people.
We’re building a miracle machine.
Building A Miracle Machine
System Design Summary
The system is designed to predict molecular binding affinity, a critical step in drug discovery.
This new simplified pipeline leverages graph neural networks (GAT) and recurrent neural networks (GRU) to harness both structural and chemical information from input molecules. We’re using specialized versions called Graph Attention Network and Gated Recurrent Units.
We’ll explore what extra power they provide in a moment.
First let’s walk through the simplified pipeline to generate predictions.
Data Preprocessing
SMILES Input: The pipeline begins with Simplified Molecular Input Line Entry System (SMILES) strings, a textual representation of chemical structures.
Dual Representation: SMILES strings are transformed into two distinct representations:
Molecular Graphs: Representing atoms as nodes and bonds as edges, capturing the molecule's structural topology.
Molecular Fingerprints: Fixed-size numerical vectors encoding chemical properties and substructures.
Padding/Masking: Graph and fingerprint representations are standardized to ensure consistent input sizes for the machine learning models.
Feature Engineering: Additional atom features (hybridization, valence, etc.) or pre-trained chemical embeddings incorporated to enrich the input data.
Model Architecture
Graph Attention Network (GAT): This model operates on the molecular graph representation. It utilizes attention mechanisms to weigh the importance of neighboring atoms when learning node embeddings. Multiple GAT layers can be stacked to capture complex structural patterns. As mentioned, starting simple and will add complexity later.
Imagine you're trying to understand a social network. You want to know not just who's connected to whom, but also how strong those connections are. That's where Graph Attention Networks come in.
GATs are like little detectives that look at all the connections (or "edges") in a graph and decide which ones are the most important. They do this by assigning each edge a score, or "attention weight." The higher the weight, the more important the connection.
This is super useful when dealing with complex data like social networks, molecules, or even traffic patterns, because it helps us figure out which connections really matter.
Gated Recurrent Unit (GRU): Now imagine you're reading a story. As you read each sentence, you're building up an understanding of what's happening. You're also remembering key details from earlier in the story that are important for understanding what's happening now.
That's similar to how Gated Recurrent Units work. They're a type of artificial neural network that's really good at processing sequences of data, like sentences in a story or frames in a video.
GRUs have a special "memory" that allows them to remember important information from earlier in the sequence. They also have "gates" that control how much of that information they keep, and how much they let new information in.
This makes them great for tasks like language translation, speech recognition, and other forms of pattern recognition.
This model processes the molecular fingerprint, a time-series-like representation of chemical features. GRUs are adept at capturing sequential dependencies, making them well-suited for analyzing fingerprints.
Combined Model: The outputs of the GAT and GRU models are combined to create a unified representation. This can be achieved through concatenation, attention mechanisms, or transformer architectures.
Output Layer: The final layer is a dense layer with a sigmoid activation function to produce a probability prediction of whether the molecule will bind to a target protein.
Training
Loss Function: Think of training a machine learning model like teaching a student. When the student gets an answer wrong, you tell them the correct answer and how far off they were. The "how far off" part is like the loss function.
In machine learning, a loss function is a way to measure how well (or poorly) a model is doing at making predictions. It calculates the difference between the model's predicted output and the actual, correct output. The smaller the difference, the better the model is performing.
Different tasks need different kinds of loss functions. For example, if you're predicting house prices, you might use a loss function that penalizes big errors more than small ones. But if you're classifying images of cats and dogs, you might use a loss function that focuses on getting the right answer, even if the confidence is low.
Metrics: Average precision (the competition metric) and area under the receiver operating characteristic curve (AUC-ROC) are tracked to assess model performance.
Hyperparameter Tuning: Optimization techniques (random search, Bayesian optimization) help find the best model configurations. Imagine you're baking a cake. You have a recipe, but it doesn't tell you exactly how long to bake it, or at what temperature. These are like hyperparameters in machine learning – settings that control how the model learns.
Hyperparameter tuning is like experimenting to find the perfect baking time and temperature. You try different combinations and see which one produces the best cake (or in this case, the best model performance).
There are different ways to do hyperparameter tuning. You can try random combinations, or you can use more systematic approaches like grid search or Bayesian optimization. The goal is to find the hyperparameters that give you the best performance on your specific task.
Resource Optimization: Techniques like subsampling, feature selection, smaller models, and regularization are employed to fit within the 30GB RAM constraint.
Pipeline Flow — How We Go From Input Data to Predictions:
SMILES Input: Read SMILES strings for molecules from competition data
Dual Representation: Convert SMILES into molecular graphs and fingerprints.
Padding/Masking: Standardize input sizes.
GAT Model: Process graph representations to learn structural embeddings.
GRU Model: Process fingerprint representations to learn chemical embeddings.
Combine Embeddings: Merge GAT and GRU outputs for a unified representation.
Output Layer: Predict binding probability.
Loss Calculation: Compare predictions to ground truth binding labels.
Backpropagation: Update model parameters based on the loss.
Repeat: Iterate through steps 1-9 for multiple epochs until model converges.
What I like about this system is the dual representation, we are leveraging both structural (graphs) and chemical (fingerprints) information to build our prediction machine.
GATs prioritize relevant parts of the molecule's structure, GRU brings critical sequential data into the prediction, while techniques like subsampling, batch generation and model size reduction make the system operable in resource-constrained environments.
That said, there are dozens of different things we can augment, adjust, amplify (and remove) to achieve higher accuracy.
Here’s the plan to improve our Miracle Machine.
Optimization Roadmap
I need to get more of the “signal” into the training data without capturing Features that don’t matter. My focus at first is getting a working pipeline of predictions flowing from this machine, and then layer more and more refinement into the data preprocessing phases of the journey.
Once the data preprocessing is close to perfect my focus will switch to optimizing the training itself, through hyperparamaters and design decisions.
Here’s a quick rundown:
Prioritize Data Efficiency
Subsampling: Since our dataset doesn't fit comfortably in memory, we can work faster if we work with a representative subset. We will employ techniques like stratified sampling to ensure a balanced representation of different classes.
Feature Selection: Working to carefully select the most informative atom features and fingerprint types. This will reduce memory usage without sacrificing too much predictive power.
Efficient Data Structures: Experimenting with memory-efficient data structures (e.g., sparse matrices for adjacency matrices, compressed arrays for fingerprints) to minimize memory footprint.
Enhance Model Architecture
Smaller Models: Opt for shallower GAT and GRU networks with fewer parameters. Going to also try using smaller embedding sizes or reducing the number of attention heads in the GAT layers.
Regularization: Techniques like dropout and L2 regularization help prevent overfitting, which is crucial when working with limited data.
Incremental Training: Consider training the model in stages (e.g., train the GAT first, then the GRU, then the combined model) to reduce memory usage during training.
Improve Training Strategies
Batch Size: Experiment with smaller batch sizes to fit more data into memory at once.
Gradient Accumulation: If even smaller batches don't fit, try gradient accumulation. This involves accumulating gradients over multiple smaller batches before updating the model's parameters.
Mixed Precision Training: Explore mixed precision training to reduce memory usage (by using 16-bit floating point numbers for some operations) while potentially speeding up training.
That is the plan — now it’s time for me to stop typing and get building!
Going to touch grass and write a follow-up piece next weekend.
Competition wraps up on July 7th.
Look out for Part II next week!
Part II is published, read it here:
👋 Thank you for reading Life in the Singularity.
I started this in May 2023 and AI has accelerated faster ever since. Our audience includes Wall St Analysts, VCs, Big Tech Engineers and Fortune 500 Executives.
To help us continue our growth, would you please Like, Comment and Share this?
Thank you again!!!