For the last three years, we’ve been sold a two-part dream: First, that Large Language Models would become the “brains” of a new generation of autonomous AI agents. Second, that these agents would revolutionize every industry by automating complex, multi-step tasks, from booking your travel (EVERY demo of agents shows this…) to running your marketing campaigns and, ultimately, to discovering new scientific breakthroughs.
This is a multi-trillion-dollar vision.
And as an investor and a builder, I’ve been all-in on it.
But if you’re like me, and you’ve actually tried to build one of these agents, you’ve hit the same brick wall. They’re stupid. They’re brittle. They work 80% of the time in a pristine demo, and the moment you change one button on your website or introduce a new rule, they collapse.
The dream has been stuck. The entire market has been stuck. And it’s been stuck because of a deep, fundamental flaw in how we’ve been trying to teach them. We’ve been trapped in a false choice between two bad options.
Option 1: The “Era of Human Data,” or what I call the SFT Trap. This is Supervised Fine-Tuning (SFT), or Imitation Learning. This is how most agents are built today.
Think of it like training a junior employee by giving them a 1,000-page Standard Operating Procedure. You spend months recording an “expert” (you) doing a task perfectly, step-by-step. “First, you click this button. Second, you fill this box. Third, you click this other button.”. The agent just memorizes your SOP.
The problem? It’s not learning; it’s mimicking. The moment it sees a state you didn’t include in the manual, it breaks. This approach is hideously expensive, impossible to scale, and fundamentally brittle.
You can’t write an SOP for the entire internet.
You certainly can’t write an SOP for life.
Option 2: The “Era of Experience,” or the RL Mirage. This is Reinforcement Learning, the “sink or swim” method. This is how AlphaGo mastered Go.
Forget the SOP. You just throw the agent into the environment and say, “I’ll give you a ‘reward’ when you complete the task.”. The agent tries millions of random things, fails almost every time, but eventually stumbles on a sequence of actions that gets it a point. Over time, it learns to maximize its score.
The problem? Most of the real world doesn’t have a scoreboard.
If an agent is navigating a website, what’s the reward?.
If it’s using three different tools to compile a research report, what’s its “score” at step 2?
Many real-world tasks are “reward-free” or the reward is so delayed (“you’ll get the bonus at the end of the year”) that it’s “inefficient and unstable” to learn from.
So we’ve been stuck. We’re either hand-crafting brittle mimics (SFT) or trying to apply game-winning strategies (RL) to a world that doesn’t play by game rules. This bottleneck has locked up the entire agent market.
Then in October 2025 a team from Meta Superintelligence Labs and FAIR dropped a paper that just changed the entire game: “Agent Learning via Early Experience.”
This isn’t just another academic paper.
This is the blueprint for actually building agents that learn.
It’s the key that unlocks the SFT Trap.
The Unlock: Learning from Curiosity, Not a Scoreboard
The Meta team did what great engineers (and politicians) do: they challenged the premise.
They looked at the “SFT vs. RL” trap and asked, “Why not a third way? A middle-ground?”.
What if an agent could learn from its own experience, without needing a reward?
They call this new paradigm Early Experience. The core idea is simple. Instead of just showing the agent the expert’s perfect path, you let the agent be curious. You let it ask, “What if I do this step instead?”.
Let’s go back to our new employee.
SFT was the 1,000-page SOP.
RL was the end-of-year bonus.
Early Experience is letting the new hire sit in a sandbox environment for a day. They see the expert action (”Click ‘Submit’”), but they also get to try their own ideas.
“What happens if I click ‘Cancel’ instead?” The system shows them the “Changes Discarded” pop-up. “What happens if I enter an invalid date in this form?” The system shows them the red “Invalid Date” error message.
No “reward” is given. No “score” is lost. The agent simply observes the consequence of its own action. It learns a fundamental, reward-free lesson: cause and effect.
This is the holy grail. It’s Scalable (like RL, because the agent generates its own data) and Reward-Free (like SFT, because it doesn’t need a score). It gets the best of both worlds. We engineers and builders in general LOVE that.
But this is where the real genius of the paper comes in. It’s not just that they collect this data. It’s how they use it. They propose two new training strategies that, as a builder, I see as the foundation for the next generation of AI tools.
1. Implicit World Modeling (IWM): Building a “Gut Feel” for the World
The first method is about building a “gut feel” for the physics of the digital world.
The agent is trained to be a fortune-teller. For every action it could take, it has to predict the future state. “Given this webpage, and I click ‘Buy Now,’ what will the next page look like?“. “Given this state, and I run this ‘Search’ command, what will the tool output be?“
This forces the model to internalize the environment’s dynamics. It’s no longer just mimicking you; it’s learning how the machine works. It learns that clicking “Submit” leads to a “Thank You” page, and clicking “Cancel” leads to a “Warning” pop-up.
This is how you cure brittleness. The agent can now handle a brand-new website because even if the buttons look different, the physics are the same.
2. Self-Reflection (SR): Teaching an Agent to Reason
This second method is pure gold. As an investor, this is the one that gets me out of my chair.
In this setup, the agent tries its own (often suboptimal) action and sees the (bad) result. Then, the system shows it the expert action and its (good) result. But it doesn’t stop there.
It asks the agent to generate a natural language explanation (a “reflection”) for why the expert action was better than its own.
The paper gives a killer example from a shopping task.
The Goal: Find a shirt under $20.
Agent’s Bad Idea: “Click the $30 red shirt.”
Expert’s Good Idea: “Click the $15 blue shirt.”
The agent is then forced to generate this “Self-Reflection”:
“While the red shirt matches the color preference, it exceeds the $20 budget constraint... The blue shirt satisfies both the style requirement and budget limit.”
Read that again. The agent isn’t just learning a move. It’s learning the principle. It’s learning about constraints. It’s learning how to reason. It’s building the “why” behind the “what.”
And the results? They speak for themselves.
Across eight diverse environments, from embodied navigation (ALFWorld) to scientific simulation (ScienceWorld) and complex web tasks (WebShop, WebArena), this Early Experience paradigm consistently improves performance over the old SFT baseline.
We’re not talking about rounding errors. On the WebShop benchmark, Implicit World Modeling improved the success rate by +18.4% over the baseline. On TravelPlanner, Self-Reflection jumped the score by +15.0%. It’s a step-change in capability.
This isn’t just an incremental improvement. It’s a new, unlocked class of performance.
The ‘So What?’: An Asymmetric Opportunity
So, why is this more than just a clever paper? Why is this an asymmetric, 100x-style opportunity?
Because the entire market has been building on the wrong foundation. We’ve been trying to scale SFT, a model built on expensive, brittle, human-generated data. This paper just handed us the keys to a new foundation—one built on cheaper, richer, agent-generated data.
This is a fundamental platform shift. We are moving from being data creators to being environment designers.
Here are the three “alpha” insights from this paper that every investor and builder needs to understand right now.
1. The Cost of Building a World-Class Agent Just Plummeted This is the most immediate, earth-shattering takeaway. The paper ran an experiment: “What if we train our ‘Early Experience’ model on less expert data?”
The results are staggering. On the WebShop benchmark, the new methods trained on only 1/8th (12.5%) of the expert data still outperformed the old SFT model trained on the full 100% of the data.
Let that sink in. This isn’t a small saving. This slashes the cost of entry for building a competitive agent by an order of magnitude. The massive, expensive “data moat” that only hyperscalers could afford just got drained. This discovery opens the market to everyone.
2. The “Extinction Event” for Data Labeling What does this kill? It kills the low-margin, “data-labeling-as-a-service” industry.
Any company whose entire value proposition is “we have 50,000 humans in a loop labeling web trajectories for SFT” is now on the clock. Their data isn’t just a commodity; it’s a cost center that this new paradigm systematically routes around. Why pay a human to create one “expert” path when the agent can create ten “learning experiences” for a fraction of the cost?
3. The New Ceiling for Performance (The “Bridge to RL”) This is the most powerful insight for us builders. The authors position this not as an alternative to RL, but as a practical bridge to it.
They ran one final, brilliant experiment. They took three models:
The dumb SFT model.
The IWM-trained model.
The SR-trained model.
And they then trained all three with Reinforcement Learning. The results, shown in Figure 3 of the paper, are a neon sign pointing to the future.
The SFT model (the old way) saw a small bump and then flatlined. It hit a low ceiling.
But the “Early Experience” models? They started at a baseline that was already higher than the SFT model’s peak... and then they kept learning. The gap between them and the SFT model grew.
This proves that Early Experience isn’t just a better starting point. It’s a fundamentally better foundation. It creates agents that are better at learning from future experience.
The Roadmap: What We Build Next
The authors are humble. In their “Limitations” section (which I always read as an “Opportunities” section) they note that their work “focus[es] on short-horizon traces” and that “extending them to address long-horizon credit assignment... remains an open challenge…”.
That’s not a failure. That’s our entire product roadmap. They’ve given us the first build. It’s our job to build the next three.
Based on this paper alone, here are the first three companies I’m looking to fund or build.
If you are working on this, or something like it, reach out to me.
Company 1: The “Agent-Gym” Platform (The Picks & Shovels) This is the obvious “picks and shovels” play. The paper’s methods (IWM and SR) are begging to be productized. We build the MLOps platform that makes “Early Experience” a one-click process.
A developer comes to us with their small “expert” SFT dataset. Our platform ingests it, spins up the environment, and automatically generates the rollout dataset. It runs the K alternative actions , collects the resulting states, generates the IWM predictions and the SR reflections, and bundles it all into a new, 10x-better training set. We sell this as a data-augmentation and model pre-training service. We are the “Synthetic Data” company for the agent-native world.
Company 2: The “Un-Runnable” Process Automator (The Killer App) We take this new agent and aim it straight at the messiest, highest-value, most complex enterprise workflows. I’m not talking about “book a flight.” I’m talking about “process this 50-page mortgage application, check it against 10 internal databases and 5 external APIs, and flag it for regulatory review.”
These are tasks no SFT model can handle because the edge cases are infinite, and no RL model can handle because there is no reward—only “correct” or “catastrophic legal failure.”
This “Early Experience” agent is the only thing that can solve this. It can learn the system dynamics (IWM) and the regulatory reasoning (SR) to finally automate this work. This is a $100B+ market in finance, legal, and healthcare alone.
Company 3: “Curiosity” as a Service (The Next Frontier) This is the 10-year, 100x vision. The paper’s “Self-Reflection” is a baby step. It reflects on one bad action. What if we build the real thing?
We build an agent that can reflect on long-horizon sequences. An agent that doesn’t just learn from mistakes we give it, but one that generates its own hypotheses about the world. An agent that asks, “I wonder what would happen if...” and then designs the cheapest, most informative experiment to find out.
This isn’t just an agent. This is a synthetic scientist. This is the foundation for a continual learning system. The paper gave us the first building block: turning future states into supervision.
We take that primitive and build a true, autonomous, curious learner.
The Starting Gun
This paper is the starting gun.
For years, the AI agent revolution has been held hostage by the SFT Trap. We’ve been stuck in an era of brittle mimics, hand-fed by an army of human labelers.
That era is now over.
We are moving to an era of curious learners. An era where agents generate their own experience, learn the physics of our digital worlds, and reason about their own mistakes.
The bottleneck isn’t just broken; the whole dam is about to burst. The platform shift is here.
Don’t watch from the sidelines.
It’s time to build.
Friends: in addition to the 17% discount for becoming annual paid members, we are excited to announce an additional 10% discount when paying with Bitcoin. Reach out to me, these discounts stack on top of each other!
Thank you for helping us accelerate Life in the Singularity by sharing.
I started Life in the Singularity in May 2023 to track all the accelerating changes in AI/ML, robotics, quantum computing and the rest of the technologies accelerating humanity forward into the future. I’m an investor in over a dozen technology companies and I needed a canvas to unfold and examine all the acceleration and breakthroughs across science and technology.
Our brilliant audience includes engineers and executives, incredible technologists, tons of investors, Fortune-500 board members and thousands of people who want to use technology to maximize the utility in their lives.
To help us continue our growth, would you please engage with this post and share us far and wide?! 🙏




The implicit world modeling approach is genuinely brilliant becaus it captures cause and effect without needing immediate rewards. META basically figured out how to teach agents through curiosity rather than carrots and sticks. This could finally make agents robust enough for real production environments where edge cases are infinite.