April 17, 2025

Question the Premise

Kelsey Piper just laid out a great thread of a secret benchmark she had been using for LLMs.

It’s simple: I post a complex midgame chessboard and ‘mate in one’. The chessboard does not have a mate in one.

She revealed the test because there’s finally a model that passed it: o4-mini-high from OpenAI. The entire thread is worth a read, but she hits the heart of the matter here:

Why is this a big deal? I invented this problem because I think it gets at the core of AI’s potential and limitations. An AI that can’t question its premises will always be limited. An AI that doubles down on its own wrong answers will too.

LLMs are pretty gullible, and it’s mostly baked right into the architecture. Apparently spending 7 minutes worth of reasoning tokens is now enough to break free.