Learning Responsible Data Filtering From Law

Training data for foundation models that are used in big data research—and how the law and legal data can inform data filtering practices—have been the focus of research by Stanford Law students working with colleagues at the Stanford Institute for Human-Centered Artificial Intelligence (HAI) and its associate director Daniel Ho, William Benjamin Scott and Luna M. Scott Professor of Law and director at Stanford’s Regulation, Evaluation, and Governance Lab (RegLab). The RegLab team recently assembled Pile of Law, a vast dataset of court and administrative opinions, legal code, case books, and other legal documents. Based on the team’s initial experiments, Pile of Law can help researchers their training data meets minimum legal standards and help researchers develop context-appropriate privacy filters while revealing problems with commonplace filtering standards.

“We don’t want models to associate people with their private content or with harmful characteristics,” says Peter Henderson, JD/PhD ’23. The team looked to the law—and the long history of courts setting standards for information disclosure. “We wondered, why not import those standards into the machine learning environment?” says Mark Krass, JD ’21 (PhD ’23). The work is forthcoming at the NeurIPS conference https://arxiv.org/abs/2207.00220.