Simpler Is Better For Autograders: Achieving Cost-Effective LLM Evaluations for Open-Ended Tasks
The increasing capabilities of large language models (LLMs) have driven a need for rigorous, scalable evaluation frameworks. One of the primary bottlenecks in meeting this demand is the cost of human grading of model outputs: Expert human graders are the gold standard for quality assessment but their effort is expensive and time-consuming. Automated methods—ranging from traditional natural language processing metrics to simpler string-matching or regular-expression techniques—offer lower-cost alternatives but often fail to capture semantic nuance and can be brittle in the face of variations in formatting or phrasing. The common pairwise setting—in which LLMs choose the better of two responses—has been well studied for using LLMs as a judge. Pairwise grading, however, has limited utility in certain open-ended domains in which a pair of responses is not available or when a more nuanced scoring scale is required to understand differences in response quality. In this report, the authors focus on pointwise scoring for more-flexible, reference-free evaluation tasks, referring to these pointwise LLM graders autograders. The report presents an empirical comparison of five approaches to such tasks: the single rubric method, metaprompting, the list-of-items method, criteria decomposition, and declarative self-improving Python (DSPy) prompt optimization. These methods are tested across four expert-graded benchmarks and five LLMs.
@techreport{dev_simpler_2026,
title = {Simpler Is Better for Autograders: Achieving Cost-Effective LLM Evaluations for Open-Ended Tasks},
author = {Dev, Sunishchal and Paskov, Patricia and Sloan, Andrew and Wei, Kevin and {Nascimento de Lima}, Pedro and Chowdhury, Swaptik and Johnson, Jason and Marcellino, William},
year = {2026},
institution = {RAND Corporation},
url = {https://www.rand.org/pubs/research_reports/RRA4618-1.html},
doi = {10.7249/RRA4618-1}
}