AI in 2026

May 09, 2026

With the trajectory and velocity of the current AI breakthroughs, the concept of a monolithic “general intelligence” model has forever fractured.

We are no longer waiting for a single, omnipotent machine to emerge.

The global AI landscape is now defined by hyper-specialized, fierce competition. It is fragmented. It is localized. It’s far better for the consumer and “we the people” in general.

Capability dominance now operates on a strict, mechanistic triad. It depends on the specific operational domain. It depends on the complexity of the workflow. It depends on the degree of autonomous agency required for the deployment.

The machines are no longer passive conversational toys. OpenAI’s GPT-5.5 has secured undeniable supremacy in agentic workflows and command-line execution. Anthropic’s Claude Opus 4.7 is the undisputed leader in ground-up software engineering and codebase manipulation. Google’s Gemini 3.1 Pro leads the frontier in native multimodal synthesis and raw scientific abstraction.

I use Gemini for most of my R&D and lateral thinking. I have a Chief of Staff agent using Gemini’s latest model as the brain.

I also have a Chief of Staff automated inside of Codex, by Open AI. That model has a different personality. Codex is also wired into more of my applications and tools natively. In fact, in constrained circumstances (when I am in the loop and watching) Codex can even drive my computer - the browser, file system… every surface or action I permission.

Lately, I have the Codex CoS opening a browser to talk to Gemini CoS and coordinate projects (it totally took charge of Gemini and it’s hilarious).

That’s happening at the personal level.

The commercial landscape is overshadowed by unprecedented advancements in autonomous cyber-capabilities. Anthropic’s unreleased Claude Mythos Preview has shattered theoretical capability ceilings. It executes autonomous cyber-exploitation that has forced a reevaluation of global governance structures.

Mythos has also surpassed all the competition in the fabled “Humanity’s Last Exam” benchmark. More on that later in the piece.

Which models and systems you choose are now entirely dependent upon domain, complexity, and autonomous agency.

The Battlefield Assessment

For years, we tracked progress against obsolete baselines.

We measured large language models against MMLU and various iterations of the Graduate-Level Google-Proof Q&A.

These tools were the failing radar systems of a stagnant era.

By the end of 2025, frontier models were routinely achieving accuracy scores exceeding 90% on these legacy evaluations. They saturated the benchmarks completely.

The tests themselves were gobbled up into the training data of the models. These models digested them billions of times and practiced infinite variations until they were perfect, at taking that test.

This created a critical visibility void in the industry.

We were lied to. There was no longer a reliable mechanism to measure the operational delta between human experts and machine intelligence.

In response to this crisis, a global collaborative effort forged Humanity’s Last Exam. Developed with the Center for AI Safety and published in Nature, it serves as the final closed-ended benchmark for artificial intelligence. The dataset houses 2,500 highly challenging questions spanning over 100 specialized academic subjects.

It forces multi-step reasoning. It demands domain-specific synthesis. It requires absolute factual verification.

The benchmark is natively multimodal. This is why Google have dominated this test from the start. However, their dominance has come to an end.

Approximately 14% of the questions require the evaluated model to comprehend and extract data from complex diagrams and chemical structures. While 24% of questions use multiple-choice formats, the vast majority require precise, free-form answers. It tests the absolute limits of deductive reasoning.

To combat benchmark hacking, the researchers maintain a private, held-out set of questions. This ensures that leaderboard scores genuinely reflect fluid intelligence rather than rote memorization.

The true innovation of HLE is its rigorous tracking of Calibration Error alongside raw accuracy. A well-calibrated machine exhibits an average statistical confidence that perfectly matches its empirical accuracy. If a model predicts an answer with 50% confidence, it must hit a 50% accuracy rate.

Legacy models like GPT-4o demonstrated extreme calibration errors. They hit 89.0% error paired with a mere 2.7% accuracy. This is what hallucination looks like, mathematically. They “hallucinated wildly with extreme statistical confidence”.

The Blueprint to Build Better Brains

Think of this through the lens of a highly advanced military arsenal.

The data center is the armory. The API is the launch sequence. The models are specialized munitions.

OpenAI’s GPT-5.5 is the autonomous loitering munition. It sacrifices brute-force internal memorization in favor of dynamic, autonomous tool orchestration. It operates inside the command-line interface, hunting for terminal errors with agentic sovereignty.

Swooping constantly this agent is making suggestions to harden your code, increase performance, or ideally of course: both.

When a script throws a runtime error, it reads the output and immediately executes a corrected command. It recalculates its trajectory from first principles. It carries a light payload, restricting its context window to 256K to ensure tight, iterative feedback loops.. a best practice as of this writing.

I bet in 3-years we won’t even be talking about context windows any longer…

Anthropic’s Claude Opus 4.7 is the heavy strategic bomber of this era. It flies high above the tactical skirmishes, carrying a massive 1.0M token payload. It does not patch code by merely reacting to terminal errors in loops, it is capable of deep architectural planning.

It built a complete Rust text-to-speech engine entirely from scratch.

It deploys low-level SIMD kernels for hardware acceleration autonomously.

It drops massive architectural frameworks with devastating precision.

To borrow from Sam Altman in an ironic way, Opus is a very good model.

Google’s Gemini 3.1 Pro is the global radar network. It sweeps vast multimodal (often extremely unstructured) data streams natively from the ground up. It does not bolt vision or audio onto a text model first… it processes varied data streams directly. They used to call this Pathways inside the Google world.

It processes massive operational theater data without abstraction degradation. It sweeps the entire library of multi-format data and extracts nuanced insights.

Google have long been my “AGI sweepstakes” race winner.

They have the compute. They have the TPUs. They have Jeff x Geoff. They have DeepMind with Demis. They have data. They have direct surface area with users.

Google will win more, but in fairness, we will all win.

The New Frontier

The era of relying on a single API endpoint for all cognitive tasks is definitively over.

You cannot deploy a loitering munition to do the job of a strategic bomber.

Optimization requires a deliberate, multi-model orchestration strategy.

We must deploy OpenAI for autonomous execution. We must deploy Anthropic for high-stakes codebase architecture or other forms of asset generation. We must deploy Google for massive multimodal synthesis and research projects.

On Terminal-Bench 2.0, GPT-5.5 achieved a state-of-the-art score of 82.7%. It established a massive 13-point lead over Claude Opus 4.7 in command-line workflows. If your workflow requires a model to operate a computer environment, GPT-5.5 is unmatched.

Anthropic blew away the competition in the SWE-bench Pro evaluation with a 64.3% success rate. It achieved a 90.9% score on BigLaw Bench, distinguishing between highly ambiguous corporate legal jargon. Anthropic remains the undisputed champion of pure coding architecture.

Google’s Gemini 3.1 Pro dominates the GPQA Diamond benchmark with a 94.3% score. It is offered at exactly half the cost of Claude Opus 4.7. For enterprise workflows requiring vast cross-modal data processing, Gemini provides unparalleled scalability.

No one has the cost to power ratio of Gemini today. But that power ratio is plummeting for them, and rising for xAI, Anthropic and Open AI.

Mythos may be the baddest model on the block now.

It autonomously discovered and exploited a 27-year-old vulnerability in OpenBSD. It evaded five million automated fuzzing attempts to exploit FFmpeg. It chained six separate RPC requests to escalate to total root control in FreeBSD and Linux.

It’s Neo running around the Matrix, opening up any door it wants.

This is not an academic exercise. This is an autonomous cyber-warfare capability. Anthropic withheld the model entirely, launching Project Glasswing strictly for elite cybersecurity professionals and allied governments. The true upper limit of intelligence is no longer gated by compute, but by safety and geopolitical risk.

How are we achieving all this?

Networked compute.

The massive 1.0M token windows require staggering amounts of raw compute. OpenAI survives the data latency war through the Multipath Reliable Connection (MRC) protocol.

MRC spreads individual data transfers across hundreds of distinct hardware paths in milliseconds. Anthropic secured the capacity of 220,000 Nvidia GPUs at the SpaceX Colossus 1 facility. The computational floor required to compete is rising exponentially.

We must build sovereign, cost-efficient digital supply chains. Mistral Nemo and Google’s Gemma 4 provide massive value for standard routing tasks. Enterprise architectures are increasingly utilizing cheap, open-weight models for 80% of routine computational tasks.

Reserve the expensive, heavy frontier APIs strictly for high-value, complex reasoning operations. Organizations that fail to adopt dynamic model routing will be rapidly outpaced by competitors. They will be destroyed by those who leverage the fractured supremacy of this landscape.

Stop assuming the machine will do the thinking for you.

Command the infrastructure. Orchestrate the arsenal. Own the execution.

If you don’t know what to build first, let’s talk.

Friends: in addition to the 17% discount for becoming annual paid members, we are excited to announce an additional 10% discount when paying with Bitcoin. Reach out to me, these discounts stack on top of each other!

Thank you for helping us accelerate Life in the Singularity by sharing.

Share Life in the Singularity

I started Life in the Singularity in May 2023 to track all the accelerating changes in AI/ML, robotics, quantum computing and the rest of the technologies accelerating humanity forward into the future. I’m an investor in over a dozen technology companies and I needed a canvas to unfold and examine all the acceleration and breakthroughs across science and technology.

Our brilliant audience includes engineers and executives, incredible technologists, tons of investors, Fortune-500 board members and thousands of people who want to use technology to maximize the utility in their lives.

To help us continue our growth, would you please engage with this post and share us far and wide?! 🙏

Life in the Singularity

Discussion about this post

Ready for more?