September 29, 2025

KV Cartridges

When we put lots of text (e.g. a whole code repo) into a language model’s context, generation cost soars because of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe called self-study, we show that this simple idea can improve throughput by 26× while maintaining quality. These smaller KV caches, which we call cartridges, can be trained once and reused for different user requests.

They have a model generate question and answer pairs with the target text in context, then use that synthetic data for context distillation training.

Would love to try this on documentation for something like Bun or Cloudflare, which LLMs often struggle with because they aren’t well represented in training and a little idiosyncratic. The self play could work same as authors did in their paper, but we’d ultimately need a test suite for real world tasks and I don’t think it’d make sense to have that be the same as the training set.

If it showed promise on some smaller models you could use it to serve some of the monster open weights models from Moonshot, Z.Ai and others.

Add it to the list of things I’d like do if I had more time, energy, and GPUs.