The Open-Source AI That’s Beating Google and OpenAI at Science

Aug 22, 2025

∙ Paid

For years, the dream of artificial intelligence has been to do more than just write emails or generate images; it has been to accelerate human discovery and help solve our most profound scientific mysteries.

But the most powerful AI minds have remained proprietary, locked away in the corporate vaults of tech giants. Open-source models, the bedrock of democratic innovation, have been playing a frustrating game of catch-up, especially in the complex, high-stakes world of science.

Until now.

In a landmark technical report, the Shanghai AI Laboratory has unveiled Intern-S1, a new breed of artificial intelligence that is not just closing the gap but is actively surpassing the world’s leading closed-source models in critical scientific domains. The core problem Intern-S1 solves is the innovation bottleneck in science. While general-purpose AIs have become adept at common tasks like coding and mathematics, they have struggled with the specialized, diverse, and often scarce data that defines fields like chemistry, materials science, and physics.

Intern-S1’s revolutionary approach is to be a “specialized generalist” . It was trained on a colossal 5 trillion tokens of data, with over half, a staggering 2.5 trillion tokens, meticulously curated from scientific fields. To achieve this, its creators built ingenious data pipelines that can intelligently parse complex scientific papers and extract knowledge with unprecedented purity. Architecturally, it’s a Mixture-of-Experts model, which functions like an internal team of specialists it can call upon for specific tasks, making it incredibly efficient . Most impressively, it uses a novel “Dynamic Tokenizer” that learns to speak the native languages of science, like the SMILES notation for molecules, compressing complex information 70% more efficiently than its predecessors.

Readers of this substack will know I worked with SMILES data a lot last summer:

Using Deep Learning and Python to Cure Diseases

Matt McDonagh

June 30, 2024

Using Deep Learning and Python to Cure Diseases

This is a follow-up piece, if you missed Part I dig in here for the baseline:

Read full story

Predicting Medicines Into Existence

Matt McDonagh

June 25, 2024

This is the first in a four-part series.

Read full story

The results are stunning. While maintaining top-tier performance on general reasoning tasks, Intern-S1 consistently outperforms models like OpenAI’s o3 and Google’s Gemini 2.5 Pro on benchmarks ranging from predicting chemical reactions to analyzing earth science data.

The single most significant contribution of this work is its resounding proof that the open-source community can lead the charge in creating AI for science. Intern-S1 is more than a model from my perspective.

It's a blueprint for democratizing the tools of discovery and empowering a new generation of researchers to tackle humanity’s greatest challenges.

The Breakthrough in Context

For the better part of a decade, the landscape of cutting-edge AI has been defined by a stark divide. On one side are the closed-source titans, models like GPT-4, Claude 4, and Gemini, incredibly powerful generalists accessible primarily through paid APIs, their inner workings a closely guarded secret. On the other side is the vibrant, collaborative world of open-source AI, which has strived to replicate and democratize these powerful technologies.

In popular domains with abundant data, like natural language, programming, and mathematics, the open-source community has made phenomenal progress, narrowing the performance gap to a razor-thin margin.

The world of scientific research remained a stubborn frontier despite all efforts. This wasn't for lack of trying. The challenge is that science doesn’t just speak in English; it communicates through a dizzying array of modalities. A scientist needs to understand text, but also molecular structures, protein sequences, time-series signals from a particle accelerator, satellite imagery, and complex diagrams.

This data is not only diverse but also often "low-resource" compared to the near-infinite ocean of text on the internet. Consequently, open-source models trained on general web data lagged significantly, unable to master the nuanced languages of scientific inquiry .

The prevailing assumption was that simply scaling up general-purpose models would eventually grant them scientific expertise: an approach that proved inefficient and slow.

Intern-S1’s breakthrough comes from rejecting this one-size-fits-all philosophy. Its creators engineered a paradigm shift through a trifecta of innovations:

Life in the Singularity

The Open-Source AI That’s Beating Google and OpenAI at Science

Using Deep Learning and Python to Cure Diseases

Predicting Medicines Into Existence

The Breakthrough in Context

This post is for paid subscribers