Law Professors Prefer AI Over Peer Answers
Abstract
Large language models (LLMs) are increasingly promoted as educational tutors,
yet most evaluations focus on domains with a single ground truth. Many disciplines,
however, hinge on judgment: reasoning, weighing ambiguity, and reaching defensi-
ble conclusions. Law provides a sharp test. We conducted a blinded evaluation of
short-answer tutoring in contracts courses with sixteen U.S. law professors. Partici-
pants created 40 representative questions, wrote answers, and judged 2,918 anonymized
comparisons between human and LLM responses. Professors rated LLMs far higher
than their peers (average win rate = 75.33%), with models performing similarly to the
best instructor. LLM responses were also rarely flagged as harmful (3.53%, vs 12.06%
for professors). Preferences for LLM answers were consistent across evaluators and
reflected shared professional standards. Our evaluation can be reliably extended to ad-
ditional models by employing a separate LLM as a judge, rendering expert agreements
an effective, scalable method to evaluate AI tutors in judgment-rich domains.