AI Hardware vs AI Software

Apr 10, 2024

Fifteen years ago, I sat in a trading room at a major investment bank. The data models we built were clever, sure, but felt limited in retrospect. Nowadays the word around every boardroom table is “AI”.

Get it straight - it's a revolution.

The way I see it, the AI engine driving this change is no monolith.

It's a pair, a dynamic duo: software and hardware. The relationship reminds me of the old fable, the tortoise and the hare.

Software is the hare, zipping around with constant improvement. Clever algorithms, new frameworks – it makes for a thrilling sprint.

But the tortoise, that's hardware.

Its steps feel slower but make the earth shake.

Think about the leap from CPUs to GPUs, or Google's TPUs shaking up the whole game. Those hardware strides aren't fast, but they fundamentally alter the landscape, letting the hare sprint ahead in a whole new direction.

Let's be clear, this is about way more than who wins the race.

AI's future depends on this dance.

We need the hare's agility to make daily breakthroughs, but without the tortoise's ground-moving power, those breakthroughs eventually feel constrained. For a data engineer who's been in the game this long, that interplay, that's where the true magic of this revolution lies.

Take the models we used to build in the investment banking days.

We packed our Spreadsheets full with clever formulas, we were always optimizing for speed of calculation within the existing constraints. AI shifts this paradigm. It's not just about making our old ways of thinking faster, it's about unlocking entirely new approaches.

The tortoise bursts through some barrier – more memory, a new processor that handles neural networks differently – and suddenly software that felt futuristic yesterday has the runway it needs. This isn't about speed for speed's sake. The point of the hare's sprint isn't to arrive first, it's to constantly explore new territory – territory made accessible only by the tortoise's might.

To me, this is what makes the AI transformation different from any tech shift I've seen.

The limits don't feel fixed anymore, at least not in the same way.

We're in a constant loop. Hardware breaks barriers, the software races to fill the new space, which fuels a need for even greater hardware capability... it's that cycle of advancement that has the potential to reshape not just industries, but the way we think about computation itself.

Software: The Hare

Back in my banking days if you wanted to do anything resembling machine learning, you'd better have been a hardcore programmer with a hefty dose of statistics knowledge. Nowadays, frameworks like TensorFlow and PyTorch have changed that completely. They're still technical tools, don't get me wrong, but the barrier to entry is worlds lower.

This isn't just about convenience.

More than anything, the democratization of AI development means more minds tackling more problems. Those frameworks offer a level of abstraction. You're less bogged down in the noise of optimizing matrix multiplication, and have more brainpower to focus on the high-level design of your models.

This makes AI accessible not just to coders, but to domain experts.

A biologist who knows Python can experiment with image analysis on their data. An economist can build their own forecasting model. That's powerful, because AI shines when it taps into that deep, specialized knowledge.

Algorithms Get Smarter

Of course, frameworks alone wouldn't be much without the algorithms to put within them. One area that fascinates me is reinforcement learning.

Remember those headlines about AI mastering games like Go? That's largely RL.

The idea of machines learning through trial and error, guided by rewards, isn't exactly new. But advances in techniques – Deep Q Networks, policy gradients, you name it – combined with greater computational power have led to massive breakthroughs. This stuff's not just for games anymore. RL is being explored to optimize things like robot control and even industrial processes.

Alongside that, computer vision feels like it's reaching an inflection point.

We've gone so far beyond basic image classification. AI systems can segment objects in complex scenes, generate images from text prompts... it's wild stuff. This isn't just tech demos, it opens doors to real-world uses: medical imaging analysis, self-driving cars finally becoming viable, who knows what else.

Mixture of Experts: Routing for Smarter AI

Traditional neural networks, despite their power, have a drawback: every part of the network is activated for every single input. This can be wasteful, especially as models grow enormous.

Enter Mixture of Experts.

The core idea of MoE is almost disarmingly simple. Instead of one monolithic network, you have a collection of smaller neural networks, each considered an "expert". Then, you have a separate 'routing function' that decides which experts should be activated for a given input.

Here's why this is a big deal.

It means that only a subset of the overall model has to crunch the numbers for any particular piece of data. Need to recognize a cat? Activate the image expert. Process a chunk of text? Route that to the language expert. This specialization can lead to massive efficiency gains.

Let's talk about the routing function itself. It's not a simple switchboard. This function is learnable, meaning the model gets better and better at deciding which experts are most relevant for each task. The result? Not just efficient use of computation, but also the potential for the model to discover latent structure in the data.

Think of it like this: imagine a dataset with a mix of animal pictures and financial reports. A clever routing function within an MoE system might learn to send those images to one set of experts, and the financial text to a completely different set. It's a form of automatic specialization within the model.

Why is this relevant to the "hare and tortoise" concept? MoE is primarily a software-level innovation. Yet, it has the potential to massively impact what we demand from hardware. By training massive models more efficiently, it creates pressure for hardware to keep up. More memory to hold those giant experts, faster connections to let them communicate.

MoE is still a relatively new research area, but it holds incredible promise to scale AI to uncharted territories.

Here's the thing: AI models are notorious for being black boxes. MoE, by its nature, introduces a degree of explainability. We can analyze which experts are firing for different tasks, helping us crack the model open a bit.

From investment banking algorithms to cat picture detectors, efficiency has always been crucial.

MoE offers a whole new way to think about efficiency within the AI engine itself – and thus influences the very trajectory of the race between our software hare and hardware tortoise.

Data is Still King

You can have the fanciest algorithm, the most intuitive framework, but feed it garbage data and you'll get...well, garbage results. AI's hunger for data is insatiable. The success we've seen lately is built, in part, on the vast datasets collected through the internet age.

But it's not just about having raw data. Preprocessing, augmentation, this is where AI engineers (and often domain experts) earn their keep. Imagine an image classification model. If all your training images are in bright daylight, the AI might choke at night. Data augmentation (rotating, changing brightness, etc.) builds robustness to those variations.

Clean, curated, labeled datasets – often tailored for a specific problem – are precious. It's one reason Big Tech, sitting on their data goldmines, have often led the charge in AI research. But smaller players can still get creative, especially now with...

Collaboration is the Superpower

The open-source movement supercharges the software hare. I'd never have gotten to where I am today without the ability to learn from code others have shared. Whole communities building on each other's work, that's what makes progress happen at a pace that would've seemed impossible even a decade ago.

Pre-trained models are a great example. Let's say you need a text classifier. Instead of starting from zero, you can grab a massive model (think GPT-3) already trained on a vast corpus of text. Fine-tuning it for your specific task requires less data and time compared to building a model from scratch.

This sharing of knowledge, it accelerates the whole field.

Now, don't confuse 'open' with 'easy.' AI development still requires expertise, but the collaborative spirit lowers the entry point and fuels the collective learning curve that lets the hare zip ahead so quickly.

Hardware: The Tortoise

Remember when GPUs first became the darling of the ML world?

It felt like a revelation. But like all innovations, there's always someone trying to build the next better thing. Google's Tensor Processing Units (TPUs) entered the scene with precisely that mindset: what if we didn't adapt existing hardware to AI, but built it for AI from the ground up?

The key difference lies in specialization. GPUs, for all their power, handle various workloads. This versatility is a strength, but for neural network operations, it introduces overhead. TPUs strip away everything not explicitly geared towards matrix operations and the specific demands of AI training.

Google has been very busy building the new generation of TPUs to accelerate AI.

Let’s explore the timeline.

Each generation of TPUs has brought significant increases in raw computational power measured in FLOPS. This allows faster training of complex AI models and larger datasets. These units have been refined over time to provide both higher performance and power efficiency. Newer generations are fabricated on smaller process nodes (e.g., from 28nm in TPUv1 to 5nm in recent generations). This leads to significant gains in performance per watt. A TPU chip contains one or more TensorCores. The number of TensorCores depends on the version of the TPU chip. Each TensorCore consists of one or more matrix-multiply units (MXUs), a vector unit, and a scalar unit.

The core of the TPU design is the MXU.

An MXU is composed of 128 x 128 multiply-accumulators in a systolic array. MXUs provide the bulk of the compute power in a TensorCore. Each MXU is capable of performing 16K multiply-accumulate operations per cycle.

The TPU's impact isn't just about raw speed, though.

TPUs, particularly the pod designs with numerous interconnected units, make scaling experiments up or down more seamless. A TPU Pod is a contiguous set of TPUs grouped together over a specialized network. The number of TPU chips in a TPU Pod is dependent on the TPU version. The vector unit is used for general computation such as activations and softmax. The scalar unit is used for control flow, calculating memory addresses, and other maintenance operations.

Subsequent TPU generations (v4 and beyond) have featured larger pools of high-bandwidth memory (HBM) directly on the chip. This reduces data transfer bottlenecks and boosts training speed.

For many AI workloads, lower-precision formats offer similar accuracy to full 32-bit floating point. TPUs were early adopters of the bfloat16 format, which balances accuracy and computational efficiency. While optimized for lower precision formats, TPUs also support 32-bit floating point for tasks where greater precision is essential.

Networking between TPUs has continually improved.

Custom high-speed interconnects and the development of TPU Pods significantly increased the scalability of large AI workloads. The ability to efficiently distribute model training across vast numbers of TPU cores has been crucial in the development of increasingly complex AI models.

What does this mean practically? In comparison to CPUs and even GPUs of the time, TPUs delivered huge leaps in speed and energy efficiency. Suddenly, researchers could train models in days that would have taken weeks before. It's not an exaggeration to say TPUs reshaped the economics of large-scale AI development.

GPUs Still Reign

Let's not crown the TPU king just yet.

GPUs are anything but obsolete. For starters, their versatility makes them irreplaceable in many areas where AI intersects with established workflows. Think 3D rendering, scientific simulations...GPUs excel at that mix of computation.

Additionally, GPUs have a massive head start in accessibility.

Developers are well-versed in tools like CUDA, the ecosystem of libraries supporting them is robust, and cloud providers make it easy to spin up GPU instances. As a data engineer, sometimes practicality and momentum win the day.

This isn't to say GPUs don't evolve.

Each generation brings architectural tweaks aimed specifically at AI performance improvements alongside the usual graphical powerhouse upgrades. GPUs and TPUs push each other – the hardware race is far from settled.

Memory is the Bottleneck

CPUs, GPUs, TPUs... nothing works without memory.

And AI, with its massive models and datasets, has an absolutely voracious appetite for the stuff. Even with a blazing fast processor, if you're constantly bottlenecked on data transfer to and from memory, you've got a problem.

Speed and sheer capacity are both key. We need memory fast enough that the processor isn't left idling while waiting for the next batch of data. But we also need enough space to hold the sheer size of modern models, especially during training when you may need to store intermediate calculation results. HBM (High Bandwidth Memory), built into some recent high-end computing machines, is a great example of the constant push in this area.

Memory breakthroughs are a game-changer for AI hardware, especially when it comes to tackling the ever-growing data demands of AI models.

Challenges of Traditional Memory

In traditional computers, processing units and memory are separate.

Data needs to be constantly shuttled back and forth, creating a bottleneck that limits processing speed — this is sometimes called the Von Neumann Bottleneck.

Standard memory (DRAM) struggles to keep pace with the data hungry nature of AI algorithms. There are many competing approaches for “memory 2.0”.

Let’s take a quick survey.

High-Bandwidth Memory (HBM):
- Stacks multiple DRAM layers vertically, creating a wider path for data flow.
- This significantly increases the bandwidth between memory and processor, allowing AI hardware to access data much faster.
Non-Volatile Memory (NVM):
- Technologies like 3D XPoint offer advantages over traditional DRAM:
  - Faster Read/Write Speeds: Can significantly speed up data access compared to DRAM.
  - Persistent Storage: Retains data even after power loss, unlike DRAM, allowing for faster model reloading during training.
In-Memory Computing (IMC):
- A radical approach that integrates processing units right within the memory itself.
- This eliminates the data transfer bottleneck entirely, potentially leading to massive speedups and lower power consumption for specific AI tasks.
- While still under development, IMC research shows promise for the future of AI hardware.

These memory breakthroughs directly translate to advancements in AI hardware in several ways:

Faster Training Times: By speeding up data access, AI models can be trained on larger datasets in less time.
More Complex Models: The increased memory bandwidth allows AI hardware to handle more complex models with billions of parameters.
Lower Power Consumption: Technologies like NVM can potentially reduce the overall power needs of AI systems.
Improved Efficiency: Faster memory combined with specialized processors like TPUs leads to a more efficient overall AI hardware architecture.

Overall, breakthroughs in memory are crucial for pushing the boundaries of AI hardware. They pave the way for faster, more powerful AI systems capable of tackling even more challenging problems.

Think of it like this: imagine your hare-like software brain has lightning-fast ideas, but the tortoise crawls along handing it those ideas one tiny notecard at a time. That's what happens when memory becomes a limitation. This is one area where pure processing power isn't enough, hardware innovation needs to happen on the storage front too.

Training Across Networks

Here's where the tortoise starts to feel downright revolutionary.

There comes a point where datasets and models are so enormous that no single machine, no matter how beefy, can handle them alone. This is where distributed training comes in.

Cutting-edge networking has been a quiet enabler of the AI boom. Super-fast interconnects allow multiple machines to work in tandem, each tackling a chunk of the problem with coordination between them. This breaks the bounds of what any single piece of hardware would otherwise be capable of.

It's not just about the network cables themselves. Model parallelism, where different parts of the neural network reside on separate machines, is where this gets truly fascinating for a data engineer. It creates immense new challenges in synchronization and optimization, but it also lets the tortoise grow a very long neck indeed.

As AI models grow larger and datasets become more massive, traditional copper-based interconnects are starting to groan under the load.

We need faster, more efficient ways to move data within computers and across vast data centers.

This is where optical interconnects shine.

The core concept is simple: instead of transmitting data as electrical signals through wires, we use light. Optical fibers, thin strands of specialized glass, act as our superhighways for data.

This offers four compelling advantages:

Blazing Speed: Light travels incredibly fast through optical fibers. This translates to lower latency (delay), especially noticeable over longer distances. When training AI models across multiple systems, every millisecond saved matters.
Bandwidth Bonanza: Optical interconnects can handle far greater data rates compared to their electrical counterparts. We're talking multiple terabits per second! This increased bandwidth unlocks larger models and training sets without the system choking on data flow.
Energy Efficiency: While it might seem counterintuitive, optical interconnects can be significantly more power-efficient than copper, especially over longer distances. As AI's energy footprint becomes a concern, this is a crucial consideration.
Goodbye, Interference: Optical fibers are immune to electromagnetic interference, a common issue with electrical cabling, especially in the noisy environment of a data center. This leads to improved signal reliability, crucial for maintaining accuracy during complex AI calculations.

The Next Big Leap: What's Beyond?

Quantum's Promise

If hardware is the tortoise and software the hare, quantum computing is a whole different beast entering the race.

It operates based on fundamentally different principles: superposition, entanglement...heady stuff for anyone not steeped in quantum physics (and I'm certainly no expert).

But let's talk potential AI implications that excite the data engineer in me. One huge area is optimization.

Many classic AI problems boil down to finding the best solution within a vast search space. Quantum algorithms, like Grover's search, offer theoretical speed-ups that would make even our most powerful classical computers look quaint in comparison.

Simulation is another field ripe for disruption. Modeling chemical reactions, material properties... these are computationally demanding tasks with huge real-world applications. Quantum computers, with their ability to represent complex systems more directly, could transform industries completely.

Yet, it's the kind of potential leap that makes me think about the shift from CPUs to GPUs all over again.

If (or rather, when) those hardware hurdles are overcome, the door swings wide open for our nimble software hare. We may need to completely rethink our AI paradigms to utilize this power.

The Brain as Blueprint

Neuromorphic computing is a different kind of gamble, but one that could revolutionize AI's relationship with energy consumption.

Traditional computers, from pocket calculators to supercomputers, follow the von Neumann architecture: memory and processing are separate entities. Our brains operate...well, very differently.

A neuromorphic computing architecture that can run some deep neural networks more efficiently

Neuromorphic chips aim to mimic elements of biological brains. Spiking neurons, distributed memory, asynchronous computation... these concepts turn conventional computer design on its head. The payoff could be AI that runs at a tiny fraction of the energy cost of our current models.

Imagine this: AI on your phone as sophisticated as GPT-4 / Claude Opus / Gemini Ultra, but without draining your battery in an hour. Tiny, embedded AI sensors that operate for years without maintenance. While matching the raw power of our data center behemoths is a distant dream, the potential for edge computing is enormous.

The tortoise metaphor still feels apt here. Neuromorphic hardware development progresses in smaller, but significant, steps. The biggest challenge is arguably on the software side. We lack a deep understanding of how our own brains achieve their computational feats, let alone how to translate those principles into algorithms for silicon-based counterparts.

The common thread between quantum and neuromorphic possibilities is uncertainty.

Will either, or both, truly reshape the AI landscape?

It's hard to say, but that's precisely what makes watching the tortoise so exciting.

Every so often, the ground truly shifts.

Maybe we don't get a single, dramatic upheaval, but steady progress in 'conventional' hardware still drives incredible gains.

Faster processors, denser memory, more sophisticated network topologies – these advancements, even if incremental, enable our software to keep sprinting ahead in turn.

The fable of the tortoise and the hare, in our AI tale, serves a dual purpose. It reminds us of the contrasting paces of progress and warns us against prematurely declaring a winner. Software, with its agility and constant reinvention, propels daily breakthroughs. A new framework here, a clever algorithmic twist there - it's the breathless sprint that captivates us in the present.

Hardware periodically changes the dynamics of the race. New memory technologies, specialized processors, massively parallel systems...these are the leaps that don't simply extend the racetrack, they alter the very terrain the hare navigates. It's hardware's might that ultimately sets a new baseline, enabling software innovation that would've been unthinkable before.

However, focusing solely on the race misses the most incredible part of this story.

Software and hardware don't exist in isolated vacuums.

They form an extraordinary feedback loop, each advance fueling a need and creating new pressure on the other.

Just look at the modern AI development pipeline.

Powerful software simulations are used to test and iterate chip designs before they ever reach the fabrication stage. GANs (Generative Adversarial Networks), an AI technique born purely in the software realm, are finding use in this domain, improving processor layouts and anticipating potential bottlenecks.

Today's hare leaves tracks for tomorrow's tortoise to follow.

A trillion hares in a simulation are virtually testing the perfect path for the tortoise to run!

The result is a cycle of progress that feels exponential in nature. A seemingly small stride by the tortoise lets the hare race further and uncover entirely new directions. Those discoveries, in turn, highlight where the hardware needs to evolve next. It's an arms race of efficiency, a constant pushing back of what was once deemed an uncrossable frontier.

What does this symphony of progress hold for us?

The most honest answer is, we only know a fraction. AI, even in its remarkable present form, feels limited primarily by our underlying tools. Give it wings, the ones only our most powerful hardware can provide, and the world it creates might be unrecognizable.

This brings me back to my Wall St days. Spreadsheets and traditional computational methods felt like they were squeezing the absolute limits out of the available data. AI offers a way to change the equation, not merely to calculate faster, but to calculate differently. It breaks down conceptual walls once considered immovable.

This isn't to idolize the idea of a pure AI future. The most profound achievements will likely stem from human ingenuity amplified by ever-evolving machines. The true innovators will be those who understand this dance, who can envision software algorithms tailored to the hardware capabilities of the future, and who use that knowledge to solve problems we haven't even fully articulated yet.

In a world shaped by this cycle of advancement, the very concept of "unimaginable" may start to lose its meaning. What feels impossible today may be tomorrow's engineering challenge, solved by a new generation of minds raised in a world where AI-infused tools accelerate their ideas.

Maybe it's less about a magical finish line, and more about accepting the perpetual exponential accelerating nature of the race itself. We're witnesses to a transformation unlike anything before – not just technology, but how computation itself propels discovery.

The hare and the tortoise will keep dancing and running their race. And we are fortunate to have a front-row seat to their spectacular, ever-evolving performance.

👋 Thank you for reading Life in the Singularity. I started this in May 2023 and technology keeps accelerating faster ever since. Our audience includes Wall St Analysts, VCs, Big Tech Data Engineers and Fortune 500 Executives.

To help us continue our growth, would you please Like, Comment and Share this?

Thank you again!!!

Share Life in the Singularity