Evaluation (commonly referred to as an “eval”) measures an AI model's performance on a specific set of benchmark tasks. Researchers use evals to assess their models' strengths and weaknesses by comparing the model's answers to the correct answers for each task, computing the accuracy, and identifying areas for improvement. However, researchers should heed Goodhart’s Law when optimizing performance on evals: using a measure as a target can render it ineffective. A model may excel at tasks in the dataset but fail to generalize across the domain, or hidden capabilities may go unnoticed if certain examples are excluded. To assess strengths and weaknesses, individuals and organizations should curate diverse evaluation datasets. As the old adage goes, “what gets measured gets managed.”
Disclaimer: Our global network of contributors to the AI & Human Rights Index is currently writing these articles and glossary entries. This particular page is currently in the recruitment and research stage. Please return later to see where this page is in the editorial workflow. Thank you! We look forward to learning with and from you.
Last Updated: February 28, 2025
Research Assistant: Amisha Rastogi
Contributor: Tej Shah
Reviewer: To Be Determined
Editor: Georgina Curto Rex
Subject: Technology
Recommended Citation: "Evaluation, Edition 3.0 Review." In AI & Human Rights Index, edited by Nathan C. Walker, Dirk Brand, Caitlin Corrigan, Georgina Curto Rex, Alexander Kriebitz, John Maldonado, Kanshukan Rajaratnam, and Tanya de Villiers-Botha. New York: All Tech is Human; Camden, NJ: AI Ethics Lab at Rutgers University, 2025. Accessed April 21, 2025. https://aiethicslab.rutgers.edu/glossary/evaluation/.