Life in the Singularity

Life in the Singularity

Share this post

Life in the Singularity
Life in the Singularity
The Open-Source AI That’s Beating Google and OpenAI at Science

The Open-Source AI That’s Beating Google and OpenAI at Science

Matt McDonagh's avatar
Matt McDonagh
Aug 22, 2025
∙ Paid
2

Share this post

Life in the Singularity
Life in the Singularity
The Open-Source AI That’s Beating Google and OpenAI at Science
Share

For years, the dream of artificial intelligence has been to do more than just write emails or generate images; it has been to accelerate human discovery and help solve our most profound scientific mysteries. Yet, the most powerful AI minds have remained proprietary, locked away in the corporate vaults of tech giants. Open-source models, the bedrock of democratic innovation, have been playing a frustrating game of catch-up, especially in the complex, high-stakes world of science. Until now.

In a landmark technical report, the Shanghai AI Laboratory has unveiled Intern-S1, a new breed of artificial intelligence that is not just closing the gap but is actively surpassing the world’s leading closed-source models in critical scientific domains . The core problem Intern-S1 solves is the innovation bottleneck in science. While general-purpose AIs have become adept at common tasks like coding and mathematics, they have struggled with the specialized, diverse, and often scarce data that defines fields like chemistry, materials science, and physics .

Intern-S1’s revolutionary approach is to be a “specialized generalist” . It was trained on a colossal 5 trillion tokens of data, with over half—a staggering 2.5 trillion tokens—meticulously curated from scientific fields . To achieve this, its creators built ingenious data pipelines that can intelligently parse complex scientific papers and extract knowledge with unprecedented purity . Architecturally, it’s a Mixture-of-Experts (MoE) model, which functions like an internal team of specialists it can call upon for specific tasks, making it incredibly efficient . Most impressively, it uses a novel “Dynamic Tokenizer” that learns to speak the native languages of science, like the SMILES notation for molecules, compressing complex information 70% more efficiently than its predecessors.

Readers of this substack will know I worked with SMILES data a lot last summer:

Using Deep Learning and Python to Cure Diseases

Matt McDonagh
·
June 30, 2024
Using Deep Learning and Python to Cure Diseases

This is a follow-up piece, if you missed Part I dig in here for the baseline:

Read full story

Predicting Medicines Into Existence

Matt McDonagh
·
June 25, 2024
Predicting Medicines Into Existence

This is the first in a four-part series.

Read full story

The results are stunning. While maintaining top-tier performance on general reasoning tasks, Intern-S1 consistently outperforms models like OpenAI’s o3 and Google’s Gemini 2.5 Pro on benchmarks ranging from predicting chemical reactions to analyzing earth science data . The single most significant contribution of this work is its resounding proof that the open-source community can lead the charge in creating AI for science. Intern-S1 is more than a model from my perspective. It's a blueprint for democratizing the tools of discovery and empowering a new generation of researchers to tackle humanity’s greatest challenges.

The Breakthrough in Context

For the better part of a decade, the landscape of cutting-edge AI has been defined by a stark divide. On one side are the closed-source titans, models like GPT-4, Claude 4, and Gemini, incredibly powerful generalists accessible primarily through paid APIs, their inner workings a closely guarded secret. On the other side is the vibrant, collaborative world of open-source AI, which has strived to replicate and democratize these powerful technologies.

In popular domains with abundant data, like natural language, programming, and mathematics, the open-source community has made phenomenal progress, narrowing the performance gap to a razor-thin margin .

The world of scientific research remained a stubborn frontier despite all efforts. This wasn't for lack of trying. The challenge is that science doesn’t just speak in English; it communicates through a dizzying array of modalities. A scientist needs to understand text, but also molecular structures, protein sequences, time-series signals from a particle accelerator, satellite imagery, and complex diagrams .

This data is not only diverse but also often "low-resource" compared to the near-infinite ocean of text on the internet . Consequently, open-source models trained on general web data lagged significantly, unable to master the nuanced languages of scientific inquiry .

The prevailing assumption was that simply scaling up general-purpose models would eventually grant them scientific expertise—an approach that proved inefficient and slow.

Intern-S1’s breakthrough comes from rejecting this one-size-fits-all philosophy. Its creators engineered a paradigm shift through a trifecta of innovations:

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Matt McDonagh
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share