Could Document Assembly and Analytics offer a Beachhead for Computational Law in 2021?

Could Document Assembly and Analytics offer a beachhead for computational law in 2021?
This is the first of a series of blog posts from Meng Weng Wong, CodeX Fellow since 2017.

In 2020, a child of CodeX quietly sprouted in the basement of the Singapore Management University’s School of Law. Its five-year mission: to develop a domain-specific language for contracts and laws suitable for real-world adoption. This post surveys related initiatives, articulates how this new language– dubbed “L4” – differs from others, and sketches a possible strategy for deployment by industry.
The idea of computable contracts goes back at least to 1996, when Ian Grigg proposed the “Ricardian Contract”: where the data elements (names, dollar amounts, dates) of a contract float apart from its textual template into the digital realm to be ingested by computers. That same year, Nick Szabo illustrated the moving parts of such a digital “smart contract” – a term that predates by over a decade Ethereum’s billion-dollar ecosystem of tokens and currencies. This is in principle like the ISDA agreements long used in high finance; the difference is that digital contracts could be democratized for use across more genres and by all contracting counterparties, not just derivatives traders at a bank.

A quarter-century has passed since then. What has happened, and what does the landscape look like today?

Computer scientists have developed formal languages to represent the semantics of contracts and laws. Typically documented in Ph.D. theses dense with mathematical notation, these languages attempt to decompose legal writings into their underlying logics – first-order, higher-order, temporal, modal, deontic – and interpret them in terms of formalisms familiar to CS theoreticians: timed automata, Petri nets, the event calculus, the simply-typed lambda calculus, state transition systems subject to Hoare logic. Work in this tradition includes
• Peyton-Jones & Eber (2001);
• CL “Contract Language” (Schneider et al at Olso 2007);
• CSL “Contract Specification Language” (Henglein and Hvitved at Copenhagen 2012);
• LPS “Logic Production Systems” (Kowalski & Sadri at Imperial, 2017);
• FCL “Formal Contract Language” (Farmer and Hu at McMaster 2018);
• Symboleo (Daniel Amyot et al, at uOttawa 2019); and
• Catala (Merigoux at Inria, 2020).

Some of these initiatives have evolved into tech startups. Eber’s paper on Composing Financial Contracts formed the basis for the FinTech company Lexifi. The Copenhagen work became LPS powers At Stanford, the Worksheets project evolved into Symbium.

During the ICO frenzy of 2017, some blockchain entrepreneurs seized upon smart contract languages as the basis for the Next Big Thing. One such project, Tezos, raised US$232 million worth of BTC and ETH, partly on the promise that its language, Michelson, would be amenable to formal verification – hence fewer bugs and better security for all those valuable digital assets. Other initiatives with equally futuristic names appeared: Kadena, Adjoint, Corda, Hyperledger, Clause, Agrello, Rootstock, Symbiont. Today, such systems are used to trade and transfer dazzling sums in the cryptocurrency economy – but by and large, they have not displaced real-world contracts in old-fashioned commerce: leases, supply chain, employment contracts, sales and purchases.
Real World, Real Problems
To the lawyer in a law firm, to the manager in an SME responsible for legal work, these innovations sound like science fiction. Practitioners in the trenches, responsible for negotiating and signing and implementing contracts every day, have prosaic needs. Just as the birth of a child and the miracle of life soon fade into a daily routine of sleep deprivation and dirty diapers, the daily challenges of contract drafting often take the form of muttered cursing at Microsoft Word’s paragraph styles and hours spent fiddling with cross-references and tables of contents. Every junior lawyer quickly develops a facility with “Track Changes” and learns to send documents in both markup and clean. Microsoft Word fails to cooperate. Superstitions abound.

Even after a contract is signed, the work is not done. Where negotiation ends, compliance begins. And the situation is dire: every day of the year, in enterprises large and small, executives discover contracts that were due for renewal months ago. Some may even have expired without either side realizing; business just goes on as usual! (An expired contract – but no one’s noticed! 2012) Backdating is rife; side letters and poorly drafted amendments are the norm. Nobody really knows what’s going on.

This state of affairs is all the more astonishing when you realize that the document assembly and contract lifecycle management industries have for decades been promising better contracts through technology. Players such as Contract Express (now Thomson Reuters HighQ), HotDocs, and Exari (acquired by Coupa) have been around since the 1990s; the latest wave of document automators includes Avvoka, Clarilis, Contract Mill, Legito, Precisely, Prose, SpringCM, Radiant Law / ConRad, Woodpecker, Checkbox, Templafy, BamLegal, Syke, Juro … the list goes on. They promise to bring order to the world of contract generation, often using the language of “low-code / no-code”. While they may have streamlined a particular niche – high-volume, pre-approved enterprise sales with limited degrees of freedom – they have not vanquished Word. Indeed, Y Combinator’s SAFE, perhaps the most widely used seed-stage investment instrument with hundreds, if not thousands, of instances signed every year, continues to be distributed as a set of downloadable .docx files.

In the 21st century, in an age of distributed systems and API protocols, it feels natural to conceive of a company as the sum of its contracts. Yet those contracts are often filed away as PDFs in dusty network drives, impossible to search efficiently. Organizations manage them not with databases but by old-fashioned search – of Google Drive (on a good day), or an email inbox (on a bad day). Often the search results turn up files with names like “Jones Contract 2016 FINAL 2nd FINAL signedbyboth (Amended).pdf”. Ultimately eyeballs are needed.

In the timeless words of the late-night TV infomercial: there’s got to be a better way!
Enter AI
Document automation produces contracts. Legal analytics consumes them.

Legal analytics startups promise to help enterprises and law firms read incoming contracts (typically PDFs and Word Docs), assess content for peculiarities, and extract the key elements – names, addresses, dollar amounts, and deadlines. Such startups have proliferated in recent years, on the strength of advances in machine learning and natural language processing. In 2018 LawGeex proudly announced the results of its “human vs robot” NDA review competition. Today the field is crowded with literally dozens of entrants: RAVN, ThoughtRiver, Luminance, Kira Systems, eBrevia, Leverton, and Seal to name just a few.

These startups boast about their ability to correctly extract key information from unstructured input. And when they say “unstructured input”, they mean it: PDFs that might well have emerged from a printer, journeyed through the physical world in an envelope on an airplane, and re-entered the digital world through a scanner. These startups have to contend with questions like: is that a page number or a paragraph number? If a name is hyphenated across two lines, how should we un-hyphenate it? Did the OCR software misread a word? Programmers accustomed to JSON for data and HTTPS for transport would be aghast. Ian Grigg might remark: in these situations, at least, the Ricardian contract clearly remains a dream.
Is this an Internet Protocol Problem?
Confession: I’m a Unix geek. I’ve been on the Internet since 1992. I installed Slackware on my 25MHz PC off a stack of 1.44M floppy disks. I’ve written RFCs for the IETF and seen my code adopted by Microsoft and Google. I know my way around SMTP – the email protocol – the way plumbers know toilets. The engineering term for plumbing is “infrastructure”. (And I’m conscious of the baggage I’m carrying: when all you have is a plunger, everything looks like a blocked toilet.)

That being said: to me, the contract management problem looks like an infrastructure problem, typical of the early days of a two-sided platform, where all parties are just beginning their journey of “digital transformation”. Standards for data transfer are not established; proprietary walled gardens abound; open-source has little foothold.

Maybe there are so many startups doing the same thing because, without a protocol for cooperation, the thing they are trying to do is unnecessarily hard. The attempt to extract meaning from unstructured input falls into the same category as GPT-3’s imperfect performances. I’m happy to read a poem GPT-3 wrote, less happy to sign a contract GPT-3 drafted. Apophenia can be a problem: AI sees faces in clouds.

If AI is not a good fit for the problem, is the solution more AI? I have heard from managing partners that nothing short of a junior associate spending hours on review is acceptable, because any AI system that reports accuracy in terms of percentages is intrinsically insufficient. “What if it misses something? Isn’t it malpractice?”

Perhaps a different solution could emerge from a fresh look at first principles. This is part of what A16Z originally meant by “software is eating the world”: not necessarily AI software, but a “born digital” way of looking at the world that gives people a fundamentally better way to do things.
End-to-End Metadata
From an information-systems point of view, today’s “contracting protocols” are tricky.

Jack Shepherd, who works at iManage RAVN, remarks: “The closest people get […] are when they use tools like Juro to embed the data schema behind the contract. The problem comes when stuff gets rendered into PDFs all the time and that gets lost”.

What does he mean by that?

On the sender side, document assembly software starts with data elements that (we hope) have been through basic data validation: phone numbers have the right number of digits, months don’t go to 13. These parameters fill contract templates, flattening to Word or PDF, becoming output that might as well have been manually drafted in the first place: black letters on white paper, all the intelligence is gone. (Remember, when Mail Merge was invented, the goal was a printout.)

On the receiver side, legal analytics AIs paw through unstructured input, trying to put Humpty Dumpty together again.

If the structured input didn’t get lost, where would it go?

It would become metadata. When someone says “photograph”, we don’t think “negative”; we think “JPEG”. When we watch TV, we expect to be able to turn on subtitles in half a dozen languages. Metadata is everywhere. Every JPEG on your phone is a swipe away from date, time, place, lens, flash, aperture. Just as JPEGs contain EXIF, PDFs can contain XMP. And XMP can contain, in turn, a data structure holding the original information, invisible to humans, but readable by machines, which breathe a sigh of relief at not having to guess about hyphenation and page numbers.
Show Me the Money
Suppose we somehow convince a critical mass of contracting parties to add metadata to their PDFs. Now a directory full of contracts can be searched, not just by filename, but by content: “show me all the contracts with a value of $2 million or more, which are expiring in the next 6 months.” “Show me all the contracts with any counterparty based in New York.” “Show me all the contracts that deal with a certain class of goods.”

Typically, one runs these kinds of queries against a database. SQL was invented to support queries like these. In our case, meeting the user where they are means there is no explicit database, whether SQL or NoSQL; all you have is the contract PDFs themselves. I assume that PDFs are here to stay; only the largest enterprises will maintain a separate contract management repository for contract metadata. In the common case, SMEs will continue to have PDFs scattered across a set of directories on shared drives and personal laptops; they will live in email inboxes and archives; they may even need to be restored from backup. (I have heard of a tech startup that makes a very good living simply searching through network drives for PDFs and determining whether or not each one is a contract.) A tool that reads the metadata in PDFs, and does the right thing with it, is an important first step.
Open Platforms and the Chicken-and-Egg Problem
So why hasn’t something like this already taken over the world? One answer: “It’s a problem of incentives. Every[one] who makes proprietary document assembly wants this metadata to stay in their walled garden,” said technologist lawyer Sol Irvine in a tweet.

Today’s document assembly and contract analytics solutions are largely closed. Their position on interoperability: “well, of course we can play nicely with everyone in the world; everyone in the world just needs to become our customer first.” It’s easy to misinterpret the success of Apple, Microsoft, and Facebook as a triumph of the walled garden, but where protocols are concerned, iPhones still talk to Android phones using WhatsApp, Telegram, and Signal – protocols that are open and platform-agnostic. Without HTTP there would be no Facebook. Without Linux there would be no AWS.

That’s why I’m putting my money on open standards for embedding contract metadata. The history of the Internet teaches that the arc of the technology universe is long, but it bends toward openness. At the Centre for Computational Law, our grant requires us to publish our software under an open-source license.

Who are the other open players in this space? In 2017, the Accord Project launched under the auspices of the Linux Foundation’s Hyperledger collaboration. It offers a data modeling language for contract schemas called Cicero. Common Accord demonstrates “Prose Objects”: template generation through clause composition. offers “an open repository of legal forms.” DocAssemble is a full-featured expert system using Python and YAML to develop wizards that ultimately produce docx and PDF.

In all these systems, data schemas can be found, explicit or implicit. Moving forward, we can extend these systems to embed metadata using an agreed schema developed using JSON-Schema, XML Schema, RDF, UML, or Typescript. The specifics don’t matter as much as the willingness of both sender and receiver sides to collaborate. I believe players like DocuSign and Adobe will embrace such an approach, as they move upstream from electronic signatures to contract construction. Even contracts that originate entirely “by hand” in Microsoft Word could lift themselves up to compatibility with this system, with the right Word plugins. In the future, perhaps law schools will expose students to these ideas in their contract drafting classes.
Data Structures, Control Structures
We’ve discussed the value of embedding metadata; you can quickly query a folder full of contracts without spending hours eyeballing them individually.

But what about this query: “A certain event has happened; some contracts may consider that to be a qualifying event for the purposes of some trigger; so show me all the contracts which oblige me to send a notice to the counterparty pursuant to the qualifying event; in each case, tell me how much time in each case I have to send the notice, and whether I need to send the notice by registered mail or if email will do.”

That goes beyond just a simple data query; for a machine to answer that question it needs to understand the logic of the contract – the moving parts, the definitions, the state of play as it changes over time.

And that’s where (finally) digital contract languages come in. Our language, L4, is designed for this scenario. It does for the moving parts, the “control structure” of a contract, what the JSON schema does for the static parts, the “data structure”. Wouldn’t it be nice if the entire semantics of a contract could be embedded in the PDF, alongside the plain data elements? Then a contract would be truly computable. While our contract language has much in common (semantically) with many of the others developed in recent years, we are optimizing for adoption: syntax, tooling, IDE support, tutorials, and web REPLs. We mentioned SQL above for a reason: there are more developers alive today who know SQL than there are lawyers in practice. Our language is inspired by SQL, at least syntactically, and we hope that our language will have as many users in future as SQL does today.

Future posts in this series will discuss the progress of the metadata proposal, and explain the syntax and semantics of L4, using case studies to illustrate its real-world applicability.

About SMU’s Research Programme in Computational Law intends to release a suite of open-source software based on open standards, designed for and use by technologists, lawyers, and non-lawyers alike. This project continues a line of work begun at CodeX and joins the ranks of a handful of efforts around the world seeking to realize the vision of “spreadsheets for law” first articulated by Michael Genesereth.

This research is supported by the National Research Foundation (NRF), Singapore, under its Industry Alignment Fund – Pre-Positioning Programme, as the Research Programme in Computational Law. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.