How Attention Residuals are Rewiring the Modern LLM
TL;DR
The foundational wiring of large language models just got a massive, long-overdue upgrade. For years, AI architectures relied on standard residual connections which blindly accumulate data layer by layer with fixed unit weights. This uniform aggregation leads to uncontrolled hidden-state growth as the network gets deeper. Now, researchers from the Kimi Team have introduced Attention Residuals. By applying softmax attention across the depth of the network, each layer can now selectively pull exactly the information it needs from previous layers using learned, input-dependent weights. To make this scale, they built Block Attention Residuals to chunk these layers together and drastically reduce memory footprints. The result is an architectural breakthrough that matches the performance of standard models trained with 1.25x more compute.
This is a smarter, leaner, and fundamentally superior way to build a neural network.
The Background
To understand why this is a monumental shift in AI …


