We Keep Reinforcing Reinforcement Learning

Jun 21, 2025

∙ Paid

You have to start in the dark.

Forget the press releases and the slick demos. Picture a thing, an algorithm, born blind and stupid in a world it doesn't understand. It thrashes around. It tries something, it fails. It tries something else, it gets a tiny reward. A single cookie. Most of what it does is useless, idiotic. This isn’t elegant. It’s a drunken crawl through a landscape of broken glass.

This is Reinforcement Learning, or RL, as the even less cool kids call it. And you have to understand this ugly, brutal process to understand why it might be the only path to the thing everyone in Silicon Valley is chasing: Artificial General Intelligence.

For decades, the high priests of AI (the academics, the lab-coat-wearing fragilistas) thought the path to intelligence was top-down.

They believed if you could just write enough clever rules, if you could build a perfect, Platonic model of the world, the machine would understand. They were trying to teach a bird to fly by giving it a physics textbook. It’s the classic error of the Intellectual Yet Idiot: mistaking the map for the territory. The real world doesn’t follow a clean set of rules. The real world is a chaotic, unpredictable mess, a generator of Black Swans.

It punishes elegant theories.

RL is different. It’s the only paradigm with skin in the game. It doesn’t get a prize for a beautiful equation. It lives or dies, succeeds or fails, based on its actions in the world. It is rewarded for empirical discovery, not for theoretical purity. It’s a Roman engineer, not a Greek philosopher. The engineer builds a bridge he knows he might have to stand under. He has a direct, painful, and immediate connection to the consequences of his actions. That’s RL. An agent is dropped into an environment—a game, a simulation, a real-world robot—and its only mandate is to maximize its reward. It learns, through relentless, mind-numbing trial and error, a policy—a strategy for acting in a way that gets it more cookies.

But for the longest time, RL agents were suckers. They were fragile. They could master a game of Pong, sure, but change the color of the paddle? Move the screen two pixels to the left? The whole thing would collapse. The agent hadn’t learned “how to play Pong.” It had just memorized a brittle sequence of pixel patterns that led to a reward. It was the turkey, fed for a thousand days, building an unshakeable statistical model that every day would be followed by another feeding, right up until the day before Thanksgiving. Its understanding of the world was catastrophically narrow. It had no memory, no sense of context. It was all action, no plot.

This was the wall. RL had the right philosophy: learning by doing, skin in the game. It lacked the cognitive machinery to make sense of a complex world, though. Its agents were trapped in the eternal present, unable to connect cause and effect over long stretches of time.

Then came the outsiders. The guys tinkering with language. They weren't even trying to solve RL's problem, much like the guys who invented the microchip weren't trying to build an iPhone. They were just trying to get a machine to translate French to English. And they stumbled upon a trick, a new architecture they called a Transformer.

The name is deceptively simple. What it really does is pay attention.

Before the Transformer, an algorithm would read a sentence one word at a time, trying to keep a running tally of meaning in its head, which would invariably get muddled. It was like trying to understand a story by hearing it through a keyhole. The Transformer, with its "self-attention mechanism," could look at the entire sentence at once. It could see that the word "it" at the end of the sentence referred to the word "animal" at the beginning. It created context. It found the relationships that mattered, ignoring the noise. It could see the whole story, not just the words.

Now, take that machine (the context engine) and bolt it onto the drunken, brawling RL agent. Suddenly, the game changes. The agent isn't just seeing a meaningless screen of pixels anymore. The Transformer can take the agent’s entire history, the last thousand things it saw, the last thousand things it did, the rewards it got…and treat it like a sentence.

It can pay attention to the parts that matter.

It can now learn things like, "That key I picked up 500 steps ago is the reason this door is opening now." The brittle link between cause and effect is forged. The agent is no longer a goldfish. It has a memory, a history. It can build a robust, internal model of its world not from some textbook, but from its own lived experience.

This combination is potent because it allows the RL agent to become antifragile. By processing vast contexts, the Transformer allows the agent to generalize from its experience, to understand the grammar of its reality, not just the vocabulary. When it encounters something new, say a Black Swan event within its little world, it doesn't collapse like the turkey. It has a better chance of adapting because it understands the underlying structure.

The path to AGI won't be paved with elegant, top-down logic. That’s the province of the fragilistas who think the world is a neat and tidy place. No, AGI will likely be bootstrapped from the bottom up, by an agent forced to learn under penalty. It will require the relentless, trial-and-error ethos of RL.

The skin in the game.

And it will require the powerful contextual understanding of the Transformer to turn that raw experience into wisdom. It's the marriage of a street fighter and a historian, and it’s the most asymmetric bet in the history of technology. The experts are still busy drawing their maps, but the territory is about to change forever.

Science is to the rescue, constantly elevating RL with the latest in techniques and technology.

The Reasoning Revolution: How Reinforcement Pre-Training is Teaching AI to Think

The rapid advancement of artificial intelligence, particularly in the domain of Large Language Models has been one of the most transformative technological narratives of the 21st century.

At the heart of this revolution lies a simple training objective: Next-Token Prediction (NTP).

For years, the dominant paradigm has been to train models on vast swathes of internet text, teaching them to do one thing with extraordinary proficiency: predict the next word or "token" in a sequence. This self-supervised approach has proven remarkably scalable, allowing models to absorb statistical patterns, facts, and linguistic structures from trillions of words.

However, this foundational method carries an implicit challenge. By optimizing solely for prediction, models can become masters of correlation without necessarily developing a deep, robust capacity for causation or reasoning. They learn what words are likely to follow others, but the objective does not explicitly compel them to understand

This has led to a critical research question in the field: How can we move beyond mere pattern matching and imbue models with genuine reasoning capabilities from the very beginning of their training, rather than attempting to add it as an afterthought?

A groundbreaking research paper from Microsoft Research, Peking University, and Tsinghua University introduces a compelling answer: Reinforcement Pre-Training (RPT). Let’s break it down.

Life in the Singularity

We Keep Reinforcing Reinforcement Learning

The Reasoning Revolution: How Reinforcement Pre-Training is Teaching AI to Think

This post is for paid subscribers