The Economic Crisis of Long Context
The first crisis is the inefficiency of the KV cache. For every token generated, a standard transformer must read the keys and values of all previous tokens from slow GPU memory. At long context lengths, this memory bandwidth requirement becomes the primary bottleneck, dwarfing the cost of computation itself. The second, less obvious crisis is the collapse of arithmetic intensity, which leads to a catastrophic collapse in hardware utilization, often to less than 5%.The Reliability Crisis of Long Context
The reliability of an agent’s output is critically dependent on the context it receives. Research has revealed fundamental failure modes in modern language models, including:- The “Lost in the Middle” Problem: Models tend to forget information that is presented in the middle of a long context window.
- Referencing Failures: Models often fail to correctly reference information from the context, leading to hallucinations and other errors.