The full list can be found on Google Scholar.
Train-before-Test Harmonizes Language Model Rankings
Authors: Guanhua Zhang, Ricardo Dominguez-Olmedo, Moritz Hardt
arXiv Preprint: 2507.05195
Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture
similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds
confusion to a growing ecosystem of competing models. Recent work attributed ranking disagreement to the phenomenon
of training on the test task: As released, different models exhibit a different level of preparation for any given
test task. A candidate solution to the problem is train-before-test: Give each model the same benchmark-specific
finetuning before evaluation. Our primary contribution is a broad empirical evaluation of train-before-test across
24 benchmarks and 61 models. We show that train-before-test significantly improves ranking agreement consistently
across all benchmarks. Whereas rankings have little external validity to start with, they enjoy a significant degree
of external validity when applying train-before-test: Model rankings transfer gracefully from one benchmark to the
other. Even within the same model family, train-before-test reduces strong ranking disagreement to near-perfect
agreement. In addition, train-before-test reduces the model-score matrix to essentially rank one, revealing new
insights into the latent factors of benchmark performance. Our work supports the recommendation to make
train-before-test a default component of LLM benchmarking.
How Benchmark Prediction from Fewer Data Misses the Mark
Authors: Guanhua Zhang, Florian E Dorner, Moritz Hardt
NeurIPS 2025
Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation
by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM evaluation) aims to select a small
subset of evaluation points and predict overall benchmark performance from that subset. In this paper, we
systematically assess the strengths and limitations of 11 benchmark prediction methods across 19 diverse benchmarks.
First, we identify a highly competitive baseline: Take a random sample and fit a regression model on the sample to
predict missing entries. Outperforming most existing methods, this baseline challenges the assumption that careful
subset selection is necessary for benchmark prediction. Second, we discover that all existing methods crucially
depend on model similarity. They work best when interpolating scores among similar models. The effectiveness of
benchmark prediction sharply declines when new models have higher accuracy than previously seen models. In this
setting of extrapolation, none of the previous methods consistently beat a simple average over random samples. To
improve over the sample average, we introduce a new method inspired by augmented inverse propensity weighting. This
method consistently outperforms the random sample average even for extrapolation. However, its performance still
relies on model similarity and the gains are modest in general. This shows that benchmark prediction fails just when
it is most needed: at the evaluation frontier, where the goal is to evaluate new models of unknown capabilities.
Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks
Authors: Guanhua Zhang, Moritz Hardt
ICML 2024
We examine multi-task benchmarks in machine learning through the lens of social choice theory. We draw an analogy
between benchmarks and electoral systems, where models are candidates and tasks are voters. This suggests a
distinction between cardinal and ordinal benchmark systems. The former aggregate numerical scores into one model
ranking; the latter aggregate rankings for each task. We apply Arrow's impossibility theorem to ordinal benchmarks
to highlight the inherent limitations of ordinal systems, particularly their sensitivity to the inclusion of
irrelevant models. Inspired by Arrow's theorem, we empirically demonstrate a strong trade-off between diversity and
sensitivity to irrelevant changes in existing multi-task benchmarks. Our result is based on new quantitative
measures of diversity and sensitivity that we introduce. Sensitivity quantifies the impact that irrelevant changes
to tasks have on a benchmark. Diversity captures the degree of disagreement in model rankings across tasks. We
develop efficient approximation algorithms for both measures, as exact computation is computationally challenging.
Through extensive experiments on seven cardinal benchmarks and eleven ordinal benchmarks, we demonstrate a clear
trade-off between diversity and stability: The more diverse a multi-task benchmark, the more sensitive to trivial
changes it is. Additionally, we show that the aggregated rankings of existing benchmarks are highly unstable under
irrelevant changes.
Towards Coherent Image Inpainting Using Denoising Diffusion Implicit Models
Authors: Guanhua Zhang, Jiabao Ji, Yang Zhang, Mo Yu, Tommi Jaakkola, Shiyu Chang
ICML 2023
Image inpainting refers to the task of generating a complete, natural image based on a partially revealed reference
image. Recently, many research interests have been focused on addressing this problem using fixed diffusion models.
These approaches typically directly replace the revealed region of the intermediate or final generated images with
that of the reference image or its variants. However, since the unrevealed regions are not directly modified to
match the context, it results in incoherence between revealed and unrevealed regions. To address the incoherence
problem, a small number of methods introduce a rigorous Bayesian framework, but they tend to introduce mismatches
between the generated and the reference images due to the approximation errors in computing the posterior
distributions. In this paper, we propose COPAINT, which can coherently inpaint the whole image without introducing
mismatches. COPAINT also uses the Bayesian framework to jointly modify both revealed and unrevealed regions, but
approximates the posterior distribution in a way that allows the errors to gradually drop to zero throughout the
denoising steps, thus strongly penalizing any mismatches with the reference image. Our experiments verify that
COPAINT can outperform the existing diffusion-based methods under both objective and subjective metrics.
Fairness Reprogramming
Authors: Guanhua Zhang, Yihua Zhang, Yang Zhang, Wenqi Fan, Qing Li, Sijia Liu, Shiyu Chang
NeurIPS 2022
Despite a surge of recent advances in promoting machine Learning (ML) fairness, the existing mainstream approaches
mostly require retraining or finetuning the entire weights of the neural network to meet the fairness criteria.
However, this is often infeasible in practice for those large-scale trained models due to large computational and
storage costs, low data efficiency, and model privacy issues. In this paper, we propose a new generic fairness
learning paradigm, called FairReprogram, which incorporates the model reprogramming technique. Specifically,
FairReprogram considers the case where models can not be changed and appends to the input a set of perturbations,
called the fairness trigger, which is tuned towards the fairness criteria under a min-max formulation. We further
introduce an information-theoretic framework that explains why and under what conditions fairness goals can be
achieved using the fairness trigger. We show both theoretically and empirically that the fairness trigger can
effectively obscure demographic biases in the output prediction of fixed ML models by providing false demographic
information that hinders the model from utilizing the correct demographic information to make the prediction.
Extensive experiments on both NLP and CV datasets demonstrate that our method can achieve better fairness
improvements than retraining-based methods with far less data dependency under two widely-used fairness criteria.
Revisiting and Advancing Fast Adversarial Training Through The Lens of Bi-Level Optimization
Authors: Yihua Zhang, Guanhua Zhang, Prashant Khanduri, Mingyi Hong, Shiyu Chang, Sijia Liu
ICML 2022
Adversarial training (AT) is a widely recognized defense mechanism to gain the robustness of deep neural networks
against adversarial attacks. It is built on min-max optimization (MMO), where the minimizer (i.e., defender) seeks a
robust model to minimize the worst-case training loss in the presence of adversarial examples crafted by the
maximizer (i.e., attacker). However, the conventional MMO method makes AT hard to scale. Thus, Fast-AT (Wong et al.,
2020) and other recent algorithms attempt to simplify MMO by replacing its maximization step with the single
gradient sign-based attack generation step. Although easy to implement, Fast-AT lacks theoretical guarantees, and
its empirical performance is unsatisfactory due to the issue of robust catastrophic overfitting when training with
strong adversaries. In this paper, we advance Fast-AT from the fresh perspective of bi-level optimization (BLO). We
first show that the commonly-used Fast-AT is equivalent to using a stochastic gradient algorithm to solve a
linearized BLO problem involving a sign operation. However, the discrete nature of the sign operation makes it
difficult to understand the algorithm performance. Inspired by BLO, we design and analyze a new set of robust
training algorithms termed Fast Bi-level AT (Fast-BAT), which effectively defends sign-based projected gradient
descent (PGD) attacks without using any gradient sign method or explicit robust regularization. In practice, we show
our method yields substantial robustness improvements over baselines across multiple models and datasets.
Demographics Should Not Be the Reason of Toxicity: Mitigating Discrimination in Text Classifications with Instance Weighting
Authors: Guanhua Zhang, Bing Bai, Junqi Zhang, Kun Bai, Conghui Zhu, Tiejun Zhao
ACL 2020
With the recent proliferation of the use of text classifications, researchers have found that there are certain
unintended biases in text classification datasets. For example, texts containing some demographic identity-terms
(e.g., "gay", "black") are more likely to be abusive in existing abusive language detection datasets. As a result,
models trained with these datasets may consider sentences like "She makes me happy to be gay" as abusive simply
because of the word "gay." In this paper, we formalize the unintended biases in text classification datasets as a
kind of selection bias from the non-discrimination distribution to the discrimination distribution. Based on this
formalization, we further propose a model-agnostic debiasing training framework by recovering the non-discrimination
distribution using instance weighting, which does not require any extra resources or annotations apart from a
pre-defined set of demographic identity-terms. Experiments demonstrate that our method can effectively alleviate the
impacts of the unintended biases without significantly hurting modelsβ generalization ability.
Selection Bias Explorations and Debias Methods for Natural Language Sentence Matching Datasets
Authors: Guanhua Zhang, Bing Bai, Jian Liang, Kun Bai, Shiyu Chang, Mo Yu, Conghui Zhu, Tiejun Zhao
ACL 2019 (Oral)
Abstract: Natural Language Sentence Matching (NLSM) has gained substantial attention from both
academics and the industry, and rich public datasets contribute a lot to this process. However, biased datasets can
also hurt the generalization performance of trained models and give untrustworthy evaluation results. For many NLSM
datasets, the providers select some pairs of sentences into the datasets, and this sampling procedure can easily
bring unintended pattern, i.e., selection bias. One example is the QuoraQP dataset, where some content-independent
naive features are unreasonably predictive. Such features are the reflection of the selection bias and termed as the
"leakage features." In this paper, we investigate the problem of selection bias on six NLSM datasets and find that
four out of them are significantly biased. We further propose a training and evaluation framework to alleviate the
bias. Experimental results on QuoraQP suggest that the proposed framework can improve the generalization ability of
trained models, and give more trustworthy evaluation results for real-world adoptations.