-
March 05, 2026
One Simple Fix That Makes LLM Benchmark Rankings Actually Agree
Model A wins on MMLU. Model B wins on ARC-Challenge. Model C wins on HellaSwag. At some point you stop trusting any of them—not because benchmarks are meaningless, but because no two of them seem to tell the same story about which model is actually better. We ran numbers on this. Across 61 language models and 24 benchmarks, we measured how consistently different benchmarks agree on model rankings using Kendall’s τ. Under standard direct evaluation—take a model, run it on a benchmark, done—average cross-benchmark agreement sits at τ = 0.52. That’s a modest correlation at best: two benchmarks can easily give very different pictures of which model is stronger. The model you’d ship based on MMLU might not be the one you’d pick from ARC-Challenge. That’s not a vague concern. It’s a measurable, systematic problem. We found a fix. It’s...