A Reasoning-Focused Legal Retrieval Benchmark

Abstract

As the legal community increasingly examines the use of large language models (LLMs) for various legal applications, legal AI developers have turned to retrieval-augmented LLMs (“RAG” systems) to improve system performance and robustness. An obstacle to the development of specialized RAG systems is the lack of realistic legal RAG benchmarks which capture the complexity of both legal retrieval and downstream legal question-answering. To address this, we introduce two novel legal RAG benchmarks: Bar Exam QA and Housing Statute QA. Our tasks correspond to real-world legal research tasks, and were produced through annotation processes which resemble legal research. We describe the construction of these benchmarks and the performance of existing retriever pipelines. Our results suggest that legal RAG remains a challenging application, thus motivating future research.

Details

Author(s):

Neel Guha
Lucia Zheng
Javokhir Arifov
Michal Skreta
Christopher Manning
Peter Henderson
Daniel E. Ho

Publish Date:

March 25, 2025

Publication Title:

CSLaw'25: Proceedings of the 2025 Symposium on Computer Science and Law

Publisher:

Association for Computing Machinery

Format:

Journal Article Page(s) 169-193

Citation(s):

Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christopher D. Manning, Peter Henderson & Daniel E. Ho, A Reasoning-Focused Legal Retrieval Benchmark, CSLAW ’25: Proc. 2025 Symp. on Comput. Sci. & L. 169 (2025).

Related Organization(s):

Stanford Law AI Initiative

Attachment(s):: Download Now

Other Publications By

Lucia Zheng Daniel E. Ho Stanford Law AI Initiative