Think about how traditional language models work.
They gobble up sentences word-by-word, crunching through complex calculations on each word equally. But let's be honest, not every word deserves the VIP treatment. Some are keys to the whole conversation, others are filler.
This seems wasteful.
…and expensive!
On April 2, 2024 Google DeepMind released their Mixture of Depths (MoD) design to solve for this.
Mixture of Depths is like hiring a super-efficient manager for your language model. This manager has a fixed budget of computational resources to spend.
They get to decide which words in a sentence are worth the heavy investment and which can get by with a lighter touch. MoD allows the model to dynamically change this resource allocation throughout its layers.
The Secret Weapon: Top-k Routing
LLMs rely on a concept called attention to process information.
Picture this: each layer of the model has a set number of "VIP tickets" (k) for words.
Attention allows the model to focus on specific parts of the input text that are most relevant for the task at hand, like generating a response, translating a sentence, or writing different creative text formats.
The model figures out which words are the most important at that stage and grants them access to the full suite of computations. Think of it as a constant competition for attention.
Imagine the LLM is reading a book.
During the attention process, the model assigns scores to each word based on its relevance to the current context. The top-k mechanism limits the number of words considered for further processing to the k words with the highest scores. This effectively prunes out less relevant words and allows the model to concentrate on the most important parts of the input.
Not only does this reduce noise, the model can be steered towards specific aspects of the input by adjusting the value of k. For instance, a lower k value might prioritize factual information in a factual language model, while a higher k could allow for more creative exploration in a poetry generation task.
Overall, the top-k mechanism is a valuable tool for controlling attention in LLMs. It helps focus the model's processing power and improves efficiency, but it's crucial to choose the right k value to balance information filtering with maintaining a comprehensive understanding of the input.
The cool part is MoD aims to achieve similar performance to traditional models while using significantly less computing power during inference (that's when we make predictions).
This saves money and can make our models run faster.
Unlike some other efficiency methods, MoD focuses on rearranging the way we use existing resources, not just adding more parameters and making our models even bigger. More on this later.
MoD allows a model to learn which words in a sentence are most important. It can then focus more computational resources on those important words. This is done across different layers of the model, meaning some layers can prioritize different words.
The goal isn't just a smarter model, it's a more efficient one.
MoD-trained models can achieve the same accuracy as traditional models using fewer FLOPs during inference (making predictions). This will reduce costs to build and serve models.
This makes them faster too.
There are multiple set-ups for the routing logic that drives MoD.
Metaphor: The Efficient Rocket
Imagine you're designing a rocket to launch a payload into orbit.
The Problem: Rockets need immense amounts of fuel (compute power) to propel themselves against gravity. Traditional rockets burn the same amount of fuel consistently, regardless of the stage of the flight. This is similar to how transformer models always use the same amount of compute on each word, even if some are easier to process.
Let's make our rocket smarter!
Instead of a single fuel tank and engine, the rocket is equipped with a series of smaller, modular engines. During flight, onboard sensors and computers (like the conditional computation algorithms in a language model) analyze real-time data like altitude, air resistance, and the payload's remaining weight. Based on this, they selectively fire only the necessary engines, sometimes at partial throttle, and at other times at full blast.
How This Maps onto Language Modeling
Tokens: Each word in a sentence is like a piece of the payload.
Transformers: Our traditional rocket with one big fuel tank and engine.
Modular Engines: Different computational pathways in the language model (similar to "experts" in a Mixture of Experts approach).
Sensors/Computers: The conditional computation algorithms deciding which computational paths (engines) to activate for each token (piece of payload).
Just as our rocket uses less fuel overall, conditional computation models like MoD use less processing power. This makes them faster and more energy-efficient.
Our smart rocket can adapt to varying conditions during the flight, optimizing fuel use for the specific mission. Similarly, conditional computation models can handle more complex sentences or languages by dynamically allocating more computing power as needed.
Just as modular engines might fit the rocket design better, conditional computation approaches can offer hardware optimization advantages, maximizing the usage of existing hardware.
Bottom Line: Think of how you communicate in the real world... do you expend the same amount of energy on every exchange?
Of course not.
MoD lets us tailor our processing. In language models, speed and resource usage are crucial. MoD promises similar performance with reduced computational costs, making these models cheaper to run. If we can get similar results with less power per word, it opens the possibility of handling larger sequences of text or even training more complex models.
There are other, longer standing efficiency methods that Google evaluated in conjunction with MoD.
Mixture-of-Experts (MoE)
Training massive neural networks is expensive. Every parameter you add and every input token you process drives up those compute costs. When you hit the realm of language models with billions of parameters, things get scary, especially considering the amount of data needed to feed these beasts.
MoE: Divide and Conquer for Neural Networks
The Mixture of Experts approach tackles this by breaking down a large model into smaller, specialized "experts". Think of it like a company with different departments: experts in marketing, finance, product development, etc. Each expert is essentially a smaller neural network trained on a subset of the data to excel in a particular niche.
The Gating Network: Traffic Cop for Expertise
But how does a new input know where to go?
That's where the "gating network" comes in.
This is another (much smaller) neural network that learns to route data to the right expert or combination of experts. It's like the smart receptionist directing customers to the relevant departments.
Instead of trying to optimize a monolithic model on a mountain of data, I potentially deal with multiple smaller models that can be trained more effectively, potentially even in parallel.
During inference (when we use the model), only the relevant experts need to run. If a customer query for a financial product only activates the "finance" expert, I'm saving compute compared to running the entire giant model. If I can make my models more efficient without sacrificing performance, that means potential cost savings.
I may even be able to train bigger, better models within the same budget.
Metaphor: The Assembly Line with Specialized Workstations
Imagine a complex assembly line where a product (like a car) moves from station to station, undergoing a series of transformations.
The Traditional Assembly Line: This is like a standard transformer model. Every product (a sentence) goes through the same set of workstations (computational layers), regardless of their specific needs. Some products might need only simple operations, while others need complex, multi-step processes.
The MoE approach introduces specialized workstations along the assembly line.
Some focus on basic assembly, others on paint, others on installing advanced electronics, etc. Each workstation is essentially an expert in a particular type of transformation.
This is like the "gating" mechanism in an MoE model.
It analyzes each incoming product (sentence) and decides which combination of specialized workstations it needs to pass through. A simple car model might only need the basic assembly, while a luxury car would go through many of the specialized experts.
How This Maps to Mixture of Experts
Assembly Line: The Transformer architecture
Product: A sentence or sequence of words
Workstations: The "expert" sub-networks within the MoE model
Intelligent Routing System: The gating function that decides which experts to use for a token
Just like products don't waste time at unnecessary workstations, MoE models avoid wasting computation on tokens that don't require complex processing.
Similar to how skilled workers on an assembly line are better at specific tasks, the expert networks in an MoE model can become highly optimized for specific types of language patterns.
Just as you can add more specialized workstations for new features, you can increase the capacity of your MoE model by adding more expert networks.
Free Alpha?
Life is filled with tradeoffs. The gating network adds a bit of complexity and computational cost. It's a trade-off between this overhead and the efficiency gains.
How do I effectively divide my data to train these experts? This becomes a new challenge to tackle.
Combining Approaches: MoE with MoD
MoE isn't magic, but it's a clever strategy in the fight against ever-growing model size and the compute costs that come with it. As a data engineer, it's a tool definitely worth having in my toolkit…
…but what happens if we layer MoE and MoD in the same system?
Now that we have a baseline let's start by quickly comparing these methods and then explore how they intersect.
Similarities
Fundamental Idea: Both MoE and MoD are based on the idea that instead of treating an entire input uniformly, it's beneficial to allow specialized processing of different parts of the input.
Efficiency Goal: Both techniques aim to improve the computational efficiency of large neural networks.
Key Differences
Focus of Specialization:
Mixture of Experts: Divides the model into multiple "experts." Each expert is a smaller neural network specializing in a particular type of input pattern. An input is routed to one or more of these experts for processing.
Mixture of Depths: Focuses on allocating computational resources across different depths of the model. It decides which words (tokens) deserve deeper processing in different layers.
Routing Mechanism:
MoE: Uses a "gating network" to decide which expert(s) should handle a particular piece of input.
MoD: Uses a simpler top-k routing. The model identifies the 'k' most important tokens at any given layer, leading to a more straightforward approach.
Computational Tradeoffs
MoE: Often results in increased parameters because of the multiple experts. This can lead to higher training costs. However, during inference, not all experts need to be active, potentially saving FLOPs.
MoD: Aims to reduce FLOPs without significantly increasing parameters. It focuses on efficient allocation of existing computation within a model.
Basically MoE divides the task among multiple specialized modules for efficient processing… whereas MoD dynamically re-allocates existing computational resources within a model, focusing deeper processing on the most important parts of the input.
MoD can come into play within each expert network in the MoE architecture.
This creates a powerful combination for 3 key reasons:
Double Efficiency Boost: MoD ensures each expert network allocates its resources efficiently, focusing on the most relevant parts of the input it receives from the gating network. This double layer of resource optimization could lead to significant cost savings.
Specialized Deep Dives: MoD allows each expert to delve deeper into the crucial parts of its assigned input. This can lead to a more nuanced understanding of different aspects of the data, potentially improving overall model performance.
Flexibility & Scalability: The modular nature of MoE makes it easier to add new experts or modify existing ones for new tasks. Combining this with MoD's dynamic allocation could lead to highly adaptable and scalable models.
Google DeepMind is already actively fusing MoD with MoE, here’s how it looks:
…and here is how the paper concludes their analysis of the performance of this new and improved super system:
Intuitively, a token might learn to route around blocks because the prediction being made at that step is easier, and hence, does not require as much compute. However, this strategy is undoubtedly not all that the network learns. If a token does not participate in self-attention at a certain block, then later tokens will also not be able to attend to it. Thus, whether tokens decide to route or not impacts both the current step’s prediction and future predictions via causal self-attention, and how the network balances these effects is guided by their influence on the overall language modeling objective.
This insight opens the door to MoD variants that decouple the routing for queries, keys and values. For example, perhaps a token would prefer to be among the queries, but not the keys, for a given self-attention computation. One can imagine extending this idea even further into the domain of "long-term memory": perhaps there are tokens that would be extremely valuable as keys, regardless of whether it is useful for them to also be among the queries at the step of their occurrence. Learned routing could be a powerful mechanism for deciding which tokens these might be, perhaps funnelling them into a long-term memory buffer that is available during future self-attention. One advantage of such an approach to long-term memory is that tokens decide once, at the moment of "memory encoding", whether they should be retrieved in the future. This is more computationally efficient than performing a full content-based lookup across an entire memory buffer for each step in the future, and could be one step towards drastically increasing the context-length available for making a prediction. Unlike MoE transformers that route between effectively the same computation (usually MLPs), MoD transformers demonstrate the value of routing among different types of computations. In this work the types were either the conventional transformer block, or a null computation (functionally equivalent to multiplying by zero). However, one can imagine extending this idea further by routing between even more types of computation. For example, perhaps some tokens are routed to "memory lookup" functions, and others are routed to "tool use" functions. In general, the routing machinery we deployed provides a knob for adjusting the types of computations available to the network and their relative cost (in total FLOPs); if one wants to introduce an expensive computation, then this can be offset by setting its capacity to some small amount, and hence, by routing only a small number of tokens to it.
Altogether, MoD transformers are another tool one can use to tune a model’s compute per forward pass (and hence inference time). The machinery used to implement MoD is also generic, and opens the doors to many extensions and integration with other techniques, such as MoE.
As you can see, we are now developing several novel approaches to increasing the efficiency in both the inference and training phases of AI development.
New techniques will be developed that will scaffold on these concepts, and AI systems will increase their leverage dramatically over the next 12-18 months.
Mixture-of-Depths-and-Experts (MoDE) from Google DeepMind will be a big step in our journey.
Of course, we’re likely to see variations of these concepts applied to post-transformer architecture as different options emerge.
AI is accelerating faster and faster thanks to rising power and efficiency.
Here is a direct link to the Mixture-of-Depths paper page:
https://huggingface.co/papers/2404.02258