Predicting Medicine Into Existence - Part IV
I had so much fun with this competition.
The winner SHOCKED the world and rocketed from nearly 900th place to win the whole thing. With $50,000 up for grabs that’s a big shakeup.
The data we were all working with represented just 35% of the set. Once the competition ended our models were tested against the remaining data and a private leaderboard was revealed with the true winners.
Here they are:
My score jumped up a bit too… but just 22 positions.
I ended up in 1723rd place out slightly more than 9,000 competitors.
While I didn’t win, I did learn and I had an incredible time:
just figuring out how to manipulate data that big
gaining enough understanding to convert SMILES to other representations
using AI tools to learn which molecular descriptors and other variables are relevant to the prediction problem
…and so many other lessons.
Like don’t overtrain the model to the data.
This team was in 2nd place, sometimes 1rst throughout the entire competition. At the very end they dropped 11 spots and out of the money.
They made 480 submissions. The person above them made a single entry.
My 18 submission pales in comparison, but they did have a team of 3 Experts and a Master.
What A Ride
The game started when I realized I needed to transform the SMILES into other representations (fingerprints, graphs, descriptors… various visual and numerical expressions) in order to give machine learning enough “surface area” to learn useful patterns.
The game began to accelerate from there.
I tried:
different models
different combination of models
different training methodologies
different methods for integrating the outputs of the models
Once I had a working baseline model the “game” became increasing my prediction accuracy.
That’s when I started to focus like a laser on feature selection and engineering.
Features Are Everything
“You are what you eat.” — this adage applies to machine learning models as well.
The data you feed the model dictates how it behaves.. what it becomes.
You don’t typically just drop data into a model and have it start digesting… typically data is cleaned and then refined (or engineered) into features.
In ML features are individual measurable properties or characteristics of a phenomenon being observed. They are the input variables used to train machine learning models and make predictions. Features are also referred to as "variables" or "attributes." They are what the engine roar to life.
Consider a dataset about housing prices.
Features in this dataset could include the square footage of the house, the number of bedrooms and bathrooms, the age of the house, and the location. These features provide information that the machine learning model can use to learn the relationship between the features and the price of the house.
Features vs. Data:
Features are the specific pieces of information extracted or derived from the raw data to be used as input for the machine learning model. They represent the aspects of the data that are considered relevant for the task at hand.
Data encompasses the entire collection of information, including raw, unprocessed observations. It can be in various formats, such as text, images, or numerical values. The data serves as the source from which features are extracted or engineered.
Think of data as the raw material, while features are the refined elements extracted from that raw material to make it usable for a specific purpose.
Let’s think about a real life example that companies use every day.
If you have data about customers, including their age, gender, purchase history, and browsing behavior, you can extract features like:
Age: a numerical feature representing the customer's age
Gender: a categorical feature representing the customer's gender
Purchase frequency: a numerical feature derived from the purchase history, indicating how often the customer makes purchases
Product preferences: a categorical feature derived from the browsing behavior, indicating the types of products the customer is interested in
These features can then be used to train a machine learning model to predict customer churn, personalize product recommendations, or segment customers for targeted marketing campaigns.
The prediction problem we are working on is infinitely more intricate.
There aren’t 2 or 3 moving pieces, there are thousands. Per molecule.
…and then each of these molecules operates in a network of nodes and edges with different interactions occurring (or not) based on the presence of nearby forces and masses. We had 295,000,000 rows of molecule interaction data to chew through.
Where is the signal in that noise?
Which Features Matter Here?
I am not remotely close to a domain expert on small molecule formation, protein binding affinity or anything biological.
Enter AI.
AI has been immensely helpful in educating me on the different areas of interest. I’ve been engaging with Gemini, GPT-4, NotebookLM and several other AI systems to gain a better fingertip feel for this space so I can flow more powerful features through the model development process.
My two best models used slightly different features, but the commonalities are below.
Let’s run through them — feel free to skim this next part.
Basic Physicochemical Properties
logP: Octanol-water partition coefficient, indicative of hydrophobicity. It's crucial for understanding how a molecule might interact with the hydrophobic parts of proteins.
hBondDonors and hBondAcceptors: These count hydrogen bond donors and acceptors in the molecule, respectively. Hydrogen bonding is fundamental for molecular recognition processes in biological systems.
tpsa (Topological Polar Surface Area): Reflects how many polar atoms are in a molecule, which affects solubility and permeability, and is important for interaction with polar sites on proteins.
Stereochemistry
NumChiralCenters: Chirality plays a significant role in drug activity as it can drastically affect the binding affinity and selectivity towards biological targets.
Double Bond Stereochemistry
NumDoubleBonds: Total number of double bonds, which can affect the rigidity or flexibility of the molecule.
NumCisDoubleBonds and NumTransDoubleBonds: The stereochemistry of double bonds can influence the shape of the molecule and therefore its biological activity and binding properties.
Ring Features
NumRings, NumAromaticRings, and NumAliphaticRings: Ring structures often play a key role in molecular stability and the ability of molecules to stack or fit into enzyme pockets or receptor sites.
Time to Pay Attention Again
Biological Science stuff is over.
You made it!
Onto the Data Science.
The Last 24-Hours
Once I got prediction pipelines flowing I needed to work quickly.
As I mentioned in Part I, most folk have been working on this competition for 3 to 4 months.. and I entered with less than 20-days to go. Speed was my only rule.
I got several architectures with different configurations working on different combination of model depths, hyperparameter configurations and the big one: feature selection.
If I had weeks and weeks, I could submit multiple notebooks each day and use the scores to tell me which Features are most important.
When taken to extremes this strategy is called probing and it is frowned upon (sometimes it is against competition rules). In reality, it is always wise to use feedback from your environment to inform your path and tactics.
I ensembled a GAT and a GRU model together using a simple average (that’s not pictured) — pictured here are the architectures for the Graph Attention Network and Gated Recurrent Unit respectively.
If I had more time I was going to leaky_relu and maybe elu next.
Earlier in the competition I went wild and tried 6 layer deep GAT and it hurt performance. I think my above mentioned activation function upgrades may have helped but I would have needed other tools (early stopping, regularizations, etc..) to avoid the famed overfit.
I fed fingerprint data to the GRU and graph data to the GAT and then trained on the training data.
Once the models were trained it was a simple averaging of their predictions to yield the final outcome.
What I’d Do Differently Next Time
Where to start?!
I’d get 7 to 10 dramatically different model architectures built and flowing predictions.
Then I’d set up a feature lab and experiment swapping out different features through those models.
Then with about a month out I’d narrow down to the top-3 and get into the weeds. Data augmentation, hyperparameters, attention mechanisms, and of course even more feature engineering.
I loved every second of this challenge.
This was an incredible experience and I look forward to sharing updates on future competitions and endeavors in data science.
👋 Thank you for reading Life in the Singularity.
I started this in May 2023 and AI has accelerated faster ever since. Our audience includes Wall St Analysts, VCs, Big Tech Engineers and Fortune 500 Executives.
To help us continue our growth, would you please Like, Comment and Share this?
Thank you again!!!