Evaluation (commonly referred to as an “eval”) measures an AI model's performance on a specific set of benchmark tasks. Researchers use evals to assess their models' strengths and weaknesses by comparing the model's answers to the correct answers for each task, computing the accuracy, and identifying areas for improvement. However, researchers should heed Goodhart’s Law when optimizing performance on evals: using a measure as a target can render it ineffective. A model may excel at tasks in the dataset but fail to generalize across the domain, or hidden capabilities may go unnoticed if certain examples are excluded. To assess strengths and weaknesses, individuals and organizations should curate diverse evaluation datasets. As the old adage goes, “what gets measured gets managed.”