Methodology - LLM Chess Benchmark

Overview

This benchmark evaluates LLM chess-playing ability by having models play games against calibrated engine anchors and other LLMs. Ratings are calculated using the Glicko-2 rating system, calibrated to approximate Lichess Classical ratings. Full implementation details can be found on Github.

Game Format

Input: LLMs receive the current board position as FEN notation, an ASCII board representation, the move history in algebraic notation, the opponent's last move, and their previous chain-of-thought (if any)
Output: Models must respond with a single move in UCI format (e.g., "e2e4")
Illegal moves: If a model plays an illegal move, it receives one retry with feedback indicating the move was illegal. A second illegal move results in forfeit (as per FIDE rules)
Time control: No explicit time limit per move, but API timeouts apply

Rating System

We use the Glicko-2 rating system, which improves upon traditional Elo by tracking rating deviation (uncertainty) and rating volatility.

Initial rating: New players start at 1500 with high uncertainty
Rating deviation (RD): Represents confidence in the rating. Lower RD means more certainty. Minimum RD is 30.
95% Confidence Interval: There is 95% confidence the true rating lies within this range
No rating floor: Traditional chess federations and servers apply rating floors for practical reasons, such as abuse prevention and ego preservation. If given a standard minimum rating, many early LLMs would collapse to the same minimum, despite having quite different capabilities. Thus, our rating system has no minimum. This allows us to compare performance between models at the bottom of the rating ladder. Some may be surprised by the negative ratings this produces, but the Glicko scale is relative. All significance is derived from differences between models, rather than the absolute numbers. Therefore, negative ratings are just as mathematically valid as positive ones.

Anchor Calibration

To provide a meaningful estimate of strength, we calibrate LLMs by having them face engine anchors with known Lichess ratings. This allows our entire rating pool to correspond to Lichess Classical ratings. These anchors include:

Maia neural network engines trained to play at specific human skill levels.
Random mover as a baseline. We assign it a rating of 400 because 400 is the lowest possible Lichess rating.
Eubos, a stronger engine to better estimate the strength of the top LLMs.

Anchor ratings are fixed and never updated. Anchor models have played thousands of games at Lichess Classical time controls. Thus, their frozen rating is highly validated, though the real-time Lichess rating of an anchor model may drift slightly over time. LLM ratings are calibrated by their performance against these anchors. We use anchors spanning the bottom, middle, and upper ranges of human Lichess Classical strength to ensure calibration across the whole pool.

FIDE Estimation

FIDE estimates are derived from Lichess Classical ratings using conversion data from ChessGoals.com survey data. These are rough approximations.

The ChessGoals survey only maps Lichess ratings to FIDE within the range of 1715 to 2500 Lichess Classical. Consequently, models with a rating outside of that range are given N/A for their FIDE estimate.

Legal Move Rate

The Legal% metric shows what percentage of an LLM's moves were legal on the first attempt (before any retry).

Limitations

Due to cost, our sample size of games is relatively low for most models.
Performance may vary based on prompt format, temperature settings, and inference provider.
Results may differ from human-style play. I predict that LLMs would perform below these ratings against humans, who are better able to find and exploit systematic weaknesses in play.

Notes

Unlike the other engines (which are all neural networks), Eubos uses search. Thus, its strength varies greatly depending on how much time it has, and setting a time limit is important. We give it 15 minutes with 10 second increment for a whole game, a Lichess Classical time control.
Some models are difficult to collect a large sample size for at this time due to API errors, such as Grok 4.1 Fast. Other models take a very long time to run due to slow token throughput and/or utilizing an unusually high number of thinking tokens. This explains the low game count of most Grok, DeepSeek Thinking, and Kimi Thinking models.