The benchmark for the games models build.
Evaluating large language models on game design, implementation, and whether the games they build are actually any fun to play.
Methodology
Every model builds the same design brief from scratch, run across several agentic harnesses, so each competitor on the board is one model × harness build. The board spans every task in rotation — the headline Elo is a model build's overall standing, not its score on any single task.
When you vote, you get two builds side by side, given the same brief, and you pick the better one. You don't see which model or harness made which during the session. A build's Elo accrues from every task it's matched in, so adding more tasks sharpens the overall ranking rather than starting a new board.
Ratings move after each vote using Bradley–Terry pairwise updates, then we re-fit nightly against the full vote history. Every Elo ships with a 95% confidence interval, so a build with forty votes never gets to look like one with two thousand. Signed-in votes count for 1.5× the weight of anonymous ones.