Chris Sutton's personal website

9/29/2025

KV Cartridges

When we put lots of text (e.g. a whole code repo) into a language model’s context, generation cost soars because of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe called self-study, we show that this simple idea can improve throughput by 26× while maintaining quality. These smaller KV caches, which we call cartridges, can be trained once and reused for different user requests.

They have a model generate question and answer pairs with the target text in context, then use that synthetic data for context distillation training.

Would love to try this on documentation for something like Bun or Cloudflare, which LLMs often struggle with because they aren’t well represented in training and a little idiosyncratic. The self play could work same as authors did in their paper, but we’d ultimately need a test suite for real world tasks and I don’t think it’d make sense to have that be the same as the training set.

If it showed promise on some smaller models you could use it to serve some of the monster open weights models from Moonshot, Z.Ai and others.

Add it to the list of things I’d like do if I had more time, energy, and GPUs.

9/28/2025

On Alignment

Have you ever tried to lose weight? Part of you wants to lose the weight, but another part wants the cake. You’re not even aligned internally, and it gets worse. Underneath the desire to lose weight there’s probably a part that wants to look better, and a part that wants to be healthier. Those parts are superficially aligned for the moment but you can see how they might diverge.

Now add a second person to the mix.

There’s a reason it’s a trope for a father to tell a child: “Ask your mother.” Deferring authority can be an effective strategy for maintaining alignment, so long as there is trust in who is being deferred to. Scaling that past a pair or a small team is difficult. “Disagree then commit” is a version of this, deferring authority to a trusted process amongst a trusted team.

Group selection is another common strategy. The stronger the filters for joining a group the easier it is to maintain alignment. A rowing crew filters for a type: willing to wake up very early and work very hard. They’re willing to do those things but probably also they want to be known as someone who does those things.

This is how a scene or a subculture starts, but as they grow in popularity the filters become weaker and alignment fractures. Same goes for startups. If you filter well, have a mission worth believing in, and a leader worth trusting, small groups can stay aligned well past small. At some point though The Game inevitably takes over.

The Game is universal. It is the result of humans organizing into hierarchies.

-twitter user meatball times

The Game is the meta strategy, politics, and maneuvering that often needs to happen to get things done in coordination systems. Deciding which rules to follow and when and which to enforce on who and when. Whisper networks and shadow assignments and stated goals flanked by real ones.

To the extent these are in service of the mission you might not even see them as bad. The founder is likely playing The Game from day one, but doing it so well it’s unremarkable. The ghost in the machine of the mission.

But it’s more commonly derided because it’s more commonly used to advance an individual’s goals in ways that only partially, if at all, align with the organization’s. This is how department heads grow head count with nothing to show for it, or an IC “fails” upward into management.

Of course that’s not the only way large organizations get misaligned. We all operate on imperfect information, and as the surrounding situation changes it’s hard to get a lot of people to all agree on a path forward, even if destination is agreed upon. Should you trudge through the bog or attempt to find a way around it?

Now picture all this on the socetial level. Longshoremen threatening strikes, teachers avoiding phonetics, central bankers setting rates, homeowners fighting developers, enterprises litigating regulations. Which of these are misaligned to society? To their own interests? Are you sure?

True alignment at this level is impossible. Maybe that’s a cop-out of a phrase anyway, maybe all we need is loose alignment. But maybe that’s a cop out as well - who says we need alignment at all? At what scale? The tension of competing interests may well be what drives society forward. And there is the rub. What is forward?

I shouldn’t have written 700 some odd tokens on alignment without discussing orientation but here we are. It’s easy to assume we know what we’re aligning toward in any given context, even when it’s very likely we don’t. That scares me more than anything.

5/6/2025

The Value of Windsurf

It’s been rumored for a few weeks now, but beginning to look the the rumors were true: OpenAI is about to close a 3 billion dollar deal to acquire Windsurf (fka Codeium).

I’ve found myself repeating the same points to a bunch of folks at Gauntlet who didn’t see how they could be valued so highly, so I’ll repeat them here again today.

First, a highly rated HN comment on today’s Bloomberg article as an example of the doubts:

I would also argue that the product could be built over two weekends with a small team. They offer some groundbreaking solutions, but since we know that they work and how, it’s easy to replicate them… That also means they have significant talent there. Hence, they are also buying the employees. The code base itself is basically worth nothing, in my opinion.

It’s just a VS Code fork, they’re just using Claude/Gemini/4o, there’s no moat, etc etc. These would all be fair critiques if it were Cline or Roo Code that were being bought $3b. But that’s not what’s being bought!

Enterprise

What’s being bought is an enterprise machine learning company in the fastest growing space for applied AI. You don’t need to read any tea leaves to understand this.

Codeium has been open about this strategy quite some time.
When they came to talk to us at Gauntlet, they mentioned several times that they started as a ML infrastructure company before pivoting to code completion, and that’s long been an advantage.
When I asked about their split between product engineers and sales engineers, they wouldn’t give hard numbers but implied it was nearly a 50/50 split
When you look at their careers page they have roles for things like Product Strategist, Federal.

So they: talk about being enterprise focused, have the talent and knowhow to sell and deploy into enterprise, staff their org to sell and deploy into the enterprise, and strategize about how to continue to sell and deploy into enterprise.

So yes it seems fair to say that the Windsurf VS-Code-fork-backed-by-SOTA-models-client would be overvalued if that’s what was being purchased. But it seems to me like that’s not really what’s being purchased.

5/1/2025

Side-Skilling

When I was a PM at CustomInk we were growing fast, which meant hitting new bottlenecks every few months or so.

At one point our product teams were sharing UX resources, and design wasn’t able to keep up with development. Rather than slow down our team I started pushing pixels, leaning heavily on writing from Jared Spool and Steve Krug to crystalize and make explicit knowledge of what had been mostly been intuition.

Later, as data requests starting queuing up for weeks, I complained to my manager. He said I could probably pick up SQL in a week and grabbed a small reference book of his shelf. He was right, and our team was off to the races again.

Each sidestep kept the team moving and added a new tool to my belt. That pattern still matters, but the reasons have changed.

Side-skilling vs Up-skilling

Up-skilling digs deeper into your current lane. A frontend engineer becomes proficient at animation and motion design. An ops manager goes deep on Six Sigma. A general ledger accountant masters driver-based forecasting. Up-skilling sharpens a single blade.

Side-skilling extends outward, picking up neighbouring competencies that let you diagnose and unblock the whole system. It turns the blade into a multi-tool.

Why it matters now more than ever

LLMs can cover the first draft of almost any specialised task. What’s left is judgment: knowing which outputs to trust, which follow-ups to run, when to switch perspective. Good judgment requires a view across functions, not just down one lane.

Why it’s easier than ever

The learning curve for a new domain has collapsed. You can pair with an LLM on SQL, ask it to critique a Figma mock, or have it explain logistics KPIs, then test what it shows you. Curiosity plus a few focused sessions often gets you to “good enough to unblock the team.”

Working practice

Identify the nearest constraint outside your role.
Learn the minimum skill to relieve that pressure.
Repeat as constraints move.

Specialisation anchors your craft; side-skilling broadens your map of the system. Both are becoming table-stakes.

Where to go from here

Now - What I'm up to
Logs - Places I've been
Uses - Things I use
Beliefs - Things I believe
Writing - Things I've written
About - Things I've done