Guanhua Zhang

A Ph.D. Student in MPI-IS

Guanhua Zhang

About Me

Hi! I’m a Ph.D. student at the Max Planck Institute for Intelligent Systems, advised by Moritz Hardt. Previously, I worked with Shiyu Chang and earned my B.S. from Harbin Institute of Technology.

Research

Have you noticed how every new LLM claims to be “the best” at something?

Benchmarks have been a key driver of progress in machine learning, but the old engine is losing its magic. Every benchmark gives different rankings—sometimes contradictory ones—making it nearly impossible to know which model is actually better. My research investigates why benchmarking is broken and how we can fix it.

The Benchmark Science

The confusion stems from LLMs handling diverse tasks that yield conflicting rankings. Drawing inspiration from social choice theory, we’ve shown that it’s fundamentally impossible to create a robust unified ranking from diverse benchmarks (think: Arrow’s impossibility theorem for benchmarking).

To make it worse, LLM evaluation has become prohibitively expensive—nearly unaffordable for academic labs. Unfortunately, we’ve shown that existing methods for efficient evaluation often fail when assessing genuinely new models.

But here’s where it gets interesting: These challenges can be effectively mitigated through train-before-test. Fine-tune each model on the same task-specific data before evaluation. This straightforward approach achieves remarkable ranking agreement across 24 tasks, finally giving us a robust and efficient solution.


Trustworthy AI Systems

Beyond evaluation, I work on making AI systems responsible and trustworthy:

Get in Touch

📧 guanhua.zhang [AT] tuebingen.mpg.de
💻 GitHub · LinkedIn

Interested in collaboration? Feel free to reach out!