o3: A Leap Forward in AI Reasoning
Update: Thank you so much for spreading the word on social media, across technology communities, in Discord/Slack and wherever else technologists, investors, builders and nerds like us gather.
Thanks to you we are getting lots of inquiries about sponsored posts, product reviews, speaking engagement opportunities and more.
Please fill out this form if you have any inquiries or collaboration ideas.
OpenAI has once again pushed the boundaries of artificial intelligence with the preview of its latest model, o3. This successor to the groundbreaking o1 model signifies a significant leap forward in AI reasoning capabilities, promising to revolutionize not only AI research but also various industries in the near future. Imagine o3 as a powerful new engine, capable of tackling complex problems with an efficiency and sophistication previously unseen in AI.
A Year of Consolidation and a Sudden Turbocharge
2024 was largely perceived as a year of consolidation in the AI field. Numerous players achieved models with capabilities on par with GPT-4, focusing on practical applications and refining existing technologies. This period was like fine-tuning the existing engine, making it more reliable and efficient. However, the unveiling of o3 has disrupted this trend, signaling a renewed focus on rapid advancements in AI reasoning. It's akin to suddenly introducing a turbocharger to the engine, significantly boosting its power and performance.
While some remain skeptical about the practical applications of reasoning models like o1 outside of specialized domains such as mathematics, coding, and physics, o3's impressive performance suggests a broader potential. Think of these specialized domains as test tracks where the engine's raw power can be measured and optimized. The limited time and public access to reinforcement learning training methods may have hindered the exploration of these models' full capabilities. It's like having a powerful engine but lacking the skilled drivers and open roads to truly push it to its limits.
OpenAI's o3 indicates a shift in the AI landscape, moving beyond the limitations of pretraining solely on internet text.
This new model demonstrates remarkable progress in reasoning evaluations, setting new benchmarks in several key areas. Here’s quick overview then we’ll dive into them.
ARC AGI Prize: o3 is the first model to surpass the 85% threshold for completing the ARC AGI prize on the public set, exceeding even the cost constraints. This is akin to the engine not only passing a rigorous emissions test but doing so with exceptional fuel efficiency.
Frontier Math Benchmark: o3 achieved a step change in state-of-the-art performance on the Frontier Math benchmark, improving from a mere 2% to an impressive 25%.
Coding Benchmarks: o3 demonstrated substantial improvements on leading coding benchmarks, including SWE-Bench-Verified, achieving a state-of-the-art score of 71.7%.
Codeforces Competition: o3 achieved an impressive score of 2727 on Codeforces, placing it at the International Grandmaster level and among the top 200 human coders globally. This is comparable to the engine outperforming seasoned race car drivers in a challenging competition. Real world results.
These advancements, achieved just three months after the initial announcement of o1, are poised to accelerate AI research significantly. As inference costs decrease, o3 and its successors will likely reshape various software engineering roles.
Unpacking o3's Capabilities - Understanding the Engine's Design
OpenAI's o3 was announced on the final day of their "12 Days of OpenAI" event, accompanied by remarkable performance results surpassing previous state-of-the-art models.
The Frontier Math benchmark, lauded by Fields Medalists Terence Tao and Timothy Gowers for its challenging nature, provides a qualitative measure of o3's reasoning prowess. o3's achievement of 25% accuracy on this benchmark underscores its significant advancement in mathematical reasoning.
The Abstraction and Reasoning Corpus (ARC) is an AI evaluation designed to measure a human-like form of general fluid intelligence.
Introduced in 2019 by François Chollet, ARC aims to assess AI systems' ability to generalize and solve abstract reasoning tasks. The ARC AGI Prize, a $1 million challenge for the first solution to a privately held-out set of ARC tasks, has seen remarkable progress with the advent of o1-class models.
OpenAI's GPT models have shown a steady increase in accuracy on ARC:
GPT-2 (2019): 0%
GPT-3 (2020): 0%
GPT-4 (2023): 2%
GPT-4o (2024): 5%
o1-preview (2024): 21%
o1 high (2024): 32%
o1 Pro (2024): ~50%
o3 tuned low (2024): 76%
o3 tuned high (2024): 87%
While o3 did not claim the prize due to exceeding the cost threshold, its achievement of 87% accuracy on the ARC AGI public eval set marks a significant milestone in AI reasoning.
o3's Architecture, Cost, and Training - Inside the Engine
The ARC AGI team collaborated with OpenAI to obtain price estimates for o3, revealing insights into the model's architecture and training. While the final pricing for API access remains to be determined, the estimated costs provide valuable clues.
Speculation arose regarding o3's architecture, with some suggesting the incorporation of tree search mechanisms. However, evidence suggests that o3 leverages the same architecture and training methodology as o1, albeit at a larger scale. The high inference costs observed in the ARC AGI evaluation can likely be attributed to the use of self-consistency methods and aggressive parallelization. This is akin to the engine employing a more sophisticated fuel injection system and a higher number of cylinders to achieve greater power and efficiency.
The base model for o3 remains uncertain, with possibilities including Orion, OpenAI's rumored GPT-5, or a new model benefiting from Orion's training data. The reported API prices from the ARC prize align with a potential 2x to 5x increase in the base model's scale.
o3's “spirit” appears to be tree-based natural language program search. o3 traverses the space of possible Chains of Thought (CoTs) describing the steps required to solve the task. It’s possible o3 is combining neural networks with symbolic AI approaches to bridge the gap between pattern recognition and logical reasoning. By integrating LLMs with symbolic reasoning and program synthesis techniques o3 appears to be enhancing logical reasoning.
I believe o3 works like this:
problem decomposition
generate CoT trees
search and evaluate them
select and refine
generate solution
Here’s the step-by-step as I see it.
Problem Decomposition: The model first analyzes the given problem and breaks it down into smaller subproblems. This decomposition could be guided by:
Keywords and phrases: Identifying action verbs, entities, and relationships.
Prior knowledge: Leveraging existing knowledge about problem-solving strategies or similar problems.
CoT Tree Generation: The model generates a tree structure where:
Nodes: Represent intermediate steps or subproblems in the reasoning process, expressed in natural language.
Edges: Indicate the logical flow between steps.
Search and Evaluation: The model explores the CoT tree, searching for paths (chains of thought) that lead to a solution. This could involve:
Heuristic search: Prioritizing paths based on factors like estimated likelihood of success, simplicity, or consistency with prior knowledge.
Simulation: Using the model itself to "execute" the steps in a CoT and predict the outcome.
Backtracking: If a path seems unpromising, the model backtracks and explores alternative branches.
Refinement and Selection: The model refines the most promising CoTs, potentially by:
Adding more detailed steps.
Clarifying ambiguous language.
Verifying steps against external knowledge.
Solution Generation: Once a satisfactory CoT is found, the model uses it to generate the final answer.
By explicitly representing the reasoning process as a tree, the LLM can better explore different strategies and avoid getting stuck in a single line of thought. The CoT tree also provides a clear and interpretable explanation of how the LLM arrived at its answer. This is vital for continuous improvements to the system via reinforcement + to enhance user confidence in the result.
This approach could enable o3 to learn more general problem-solving strategies that can be applied to new tasks, too. The famed generalizability.
OpenAI's o3 represents a significant advancement in AI reasoning, pushing the boundaries of what AI systems can achieve. Its impressive performance on various benchmarks, including the ARC AGI prize and Frontier Math, highlights its potential to revolutionize AI research and applications.
Of course, these are the most expensive and least efficient these models will ever be. As these models continue to evolve and inference costs decline, they will likely reshape various industries and accelerate progress towards artificial general intelligence. The o3 engine is a testament to the power of continuous innovation and a glimpse into the future of AI, where complex reasoning and problem-solving are no longer exclusive to human intelligence.
I see a future where we interact at a base level with lower cost LLMs and then together with these “daily use” LLMs we prepare really valuable queries to pass to the o3 class models on a weekly basis / monthly basis.
Thank you for helping us grow Life in the Singularity by sharing.
I started this letter in May 2023 to track all the accelerating changes in AI/ML, robotics, quantum computing and the rest of the technologies accelerating humanity forward into the future.
Our brilliant audience includes engineers and executives, incredible technologists, Fortune 500 board members and thousands of people who want to use technology to maximize the utility in their lives.
To help us continue our growth, would you please Like, Comment and Share this?
Life in the Singularity is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.