Professor-Student Collaboration at Stanford Law School Results in the Largest-Ever Public Dataset of Corporate Contracts
Stanford Law School Professor Julian Nyarko calls contracts “the invisible infrastructure of the economy.” But for all their importance, they’ve remained remarkably difficult to access and study in the aggregate.
A new project out of Stanford Law School is changing that.

Nyarko, along with Stanford Law student Peter Adelson, JD/MBA ’25 (BS/MS ’17), recently unveiled the Material Contracts Corpus (MCC), a first-of-its-kind, publicly accessible dataset containing more than 1 million contracts filed by public companies with the U.S. Securities and Exchange Commission between 2000 and 2023. The MCC transforms decades of company filings into a richly annotated, machine-readable dataset, making systematic empirical analysis of contract language both possible and easily accessible for the first time. And unlike proprietary tools that offer limited access to contract data, the MCC is fully open and free to use.
A Treasure Trove for Scholars, Practitioners, and Policymakers
Nyarko has been developing the MCC on and off since 2016, initially relying on his self-taught coding skills to extract and organize contract data from the SEC’s EDGAR system. Recent advances in AI, coupled with Adelson’s computer science experience, allowed the project to gain rapid momentum over the last year.
While agreements filed by public companies are technically available to anyone, they’re often buried within exhibits, inconsistently labeled, and formatted in ways that make systematic analysis difficult. The MCC addresses these challenges, offering a clean, searchable interface through which users can explore decades of contracting practices across sectors, transaction types, and jurisdictions. Agreement types are standardized, party names are normalized, and metadata is tagged for precise retrieval.
Visit the Material Contracts Corpus
“By opening access to over one million high-stakes agreements, we hope the MCC gives scholars, practitioners, and policymakers a way to study how the legal foundations of commerce evolve in real time,” said Nyarko, who is also an associate director and senior fellow at the Stanford Institute for Human-Centered AI (HAI). “This dataset empowers empirical research in law, economics, and corporate governance, and offers a testbed for next-generation AI legal tools. It also enables better transparency in corporate behavior and has the potential to inform courtroom litigation, securities regulation, and policymaking.”
The project reflects a growing movement in legal academia to integrate computational tools into doctrinal and empirical research. Nyarko, who focuses much of his scholarship on computational methods for contract law as well as algorithmic fairness, has authored a number of first-ever studies involving AI, including a recent paper proposing new methods to mitigate bias in Large Language Models. “Breaking Down Bias: On The Limits of Generalizable Pruning Strategies” found that that racial and other biases exhibited by LLMs can be “pruned” away, but because the biases are highly context-specific, there are limits to holding AI model developers (like OpenAI or Google Vision) liable for harmful behavior, given that those companies won’t be able to come up with a one-size-fits-all solution.

Adelson, who was a student in Nyarko’s Contracts class three years ago, said the MCC project was so rewarding and interesting that it inspired him to pursue a career in academia. He will begin a Ph.D. in the Stanford Graduate School of Business in the fall. “The project was a great way to fuse my interests in law and technology,” said Adelson, who spearheaded the creation and labeling of the dataset and obtained the contracts.
Designed with Legal Professionals and Researchers in Mind
The MCC is intended to serve three key constituencies, Nyarko said: Practicing attorneys, particularly those in transactional or in-house roles, can use the corpus to benchmark language, identify market standards, and surface historical examples of key provisions. Empirical legal scholars can gain access to a rigorously organized dataset suitable for longitudinal studies of contract structure, negotiation trends, or the diffusion of boilerplate language. And computer scientists and legal technologists have a new training ground for developing AI applications tailored to legal language.
“I’m very excited to see how people use the MCC,” Adelson said. “I’ve been using it in some ongoing research projects, and it is a very rich resource.”
About Stanford Law School
Stanford Law School is one of the nation’s leading institutions for legal scholarship and education. Its alumni are among the most influential decision makers in law, politics, business, and high technology. Faculty members argue before the Supreme Court, testify before Congress, produce outstanding legal scholarship and empirical analysis, and contribute regularly to the nation’s press as legal and policy experts. Stanford Law School has established a model for legal education that provides rigorous interdisciplinary training, hands-on experience, global perspective and a focus on public service.