About Me
Hi! I’m a Ph.D. student at the Max Planck Institute for Intelligent Systems, advised by Moritz Hardt. Previously, I worked with Shiyu Chang and earned my B.S. from Harbin Institute of Technology.
Research
Have you noticed how every new LLM claims to be “the best” at something?
Benchmarks have been a key driver of progress in machine learning, but the old engine is losing its magic. Every benchmark gives different rankings—sometimes contradictory ones—making it nearly impossible to know which model is actually better. My research investigates why benchmarking is broken and how we can fix it.
The Benchmark Science
The confusion stems from LLMs handling diverse tasks that yield conflicting rankings. Drawing inspiration from social choice theory, we’ve shown that it’s fundamentally impossible to create a robust unified ranking from diverse benchmarks (think: Arrow’s impossibility theorem for benchmarking).
To make it worse, LLM evaluation has become prohibitively expensive—nearly unaffordable for academic labs. Unfortunately, we’ve shown that existing methods for efficient evaluation often fail when assessing genuinely new models.
But here’s where it gets interesting: These challenges can be effectively mitigated through train-before-test. Fine-tune each model on the same task-specific data before evaluation. This straightforward approach achieves remarkable ranking agreement across 24 tasks, finally giving us a robust and efficient solution.
Trustworthy AI Systems
Beyond evaluation, I work on making AI systems responsible and trustworthy:
- Dataset bias — Identifying and mitigating selection bias and demographic bias in NLP datasets
- Fairness without retraining — Reprogramming models to be fair by learning input perturbations
- Robust training — Fast adversarial training through bi-level optimization
Get in Touch
📧 guanhua.zhang [AT] tuebingen.mpg.de
💻 GitHub · LinkedIn
Interested in collaboration? Feel free to reach out!