LawProp: Using Quotations to Identify Legal Propositions in Judicial Opinions

Abstract

LawProp is an online tool[1] that extracts and displays the key “legal propositions” expressed by judicial opinions. These propositions are quite simply the passages of an opinion that are quoted directly by other, subsequently issued opinions. By analyzing these subsequent quotations it is possible to create a database of legal propositions containing measures of importance and relatedness. Furthermore it is possible to automatically generate and display useful “headnotes” for opinions without a manual editorial process.

Introduction

This project was the result of an attempt to address two problems. First, in the common law tradition, the legal system consists largely of legal precedent (i.e., “propositions”) established by judges and expressed in judicial opinions. These propositions are scattered across the vast corpus of American law, buried in otherwise fact-specific discussions, with no intrinsic markers of their precedential value. Consequently it is difficult to systematically identify the propositions which pertain to a particular legal issue. To prepare a case for litigation, a lawyer must carefully research secondary sources and primary materials to gather and validate the propositions which apply with authority to his or her case. My first question was how I could apply relatively unsophisticated algorithms (embodying little or no legal domain knowledge) to the text of legal opinions in order to extract these propositions automatically, thereby simplifying the process of legal research.

The second problem is related to commercial considerations in the legal research industry. Currently, the legal research industry is dominated by the duopoly of WestLaw and Lexis, whose revenues enable them to employ large staffs capable of manually curating legal opinions and identifying the most significant propositions from each, which are annotated as “headnotes.” While a number of free or lower-cost alternatives to WestLaw and Lexis have been created in recent years, these startups may not have the resources to employ a large editorial staff. Consequently, in order to compete effectively in the legal research market, these companies are in need of automated solutions to provide useful annotations to the primary legal material, in particular headnotes.

My partial solution to both of these problems is to identify the key propositions in an opinion by identifying the passages which are directly quoted by other opinions.[2] This solution has the following advantages:

  1. It is easy to implement, requiring only relatively straightforward text processing.
  2. The propositions can be ranked using the number of times they are quoted as a measure of importance.
  3. Though the identification of the propositions is automated, they are effectively selected by judges who have determined that the quoted language is authoritative and useful in resolving a particular subsequent controversy.

 

Implementation 

The implementation of LawProp involved two steps: (A) parsing legal opinions to find the “outbound” quotes which referenced another opinion; and (B) for each referenced opinion, matching the “inbound” quotes to particular passages within the text.

Finding Outbound Quotes

The first step in the implementation involved collecting a large corpus of judicial opinions. Some of the difficulties involved are discussed later in this paper. Once a corpus was assembled, finding outbound quotations involved the following steps:

  1. Identify the citations.   Programmatically identifying a citation from one opinion to another, given only the raw text of the opinions, is a non-trivial task. In my case, I was aided by the presence of XML markup in the opinions which identified the primary (and in some cases secondary) reporter locations for each opinion (i.e., where the opinion was published) and most instances where that opinion cited to another. With this information in hand, it was only necessary to perform some basic normalization of the citations to accommodate divergent citation styles in order to accurately resolve the referent of each citation.
  2. Find the quotations. A relatively simple regular expression was employed to identify quotations. While quotations generally take the form ‘”’, followed by characters other than ‘”’, followed by ‘”’, additional checking is necessary to perform gracefully in the case of multi-paragraph quotations (where each paragraph prior to the last contains an opening quotation mark but no closing quotation mark) and the rare case where an opinion mistakenly omits the closing quotation mark.
  3. Separate the opinion into sentences. In order to facilitate this step, it was necessary to parse the opinions into grammatical sentences. In order to accommodate abbreviations unique to legal text, I accumulated a library of over 700 common legal abbreviations to prevent my parser from erroneously truncating sentences based on the periods terminating these abbreviations.
  4. Attach quotations to citations. Finally, for each quotation it was necessary to determine whether the quotation was associated with a citation to an opinion (and if so, which). To this end I developed a simple heuristic which prioritized (1) a citation in the same sentence which followed the quotation, (2) a citation in the same sentence preceding the quotation, and finally (3) a citation in the subsequent sentence, provided the latter appeared to be a “citation sentence.”

Finding Propositions

With a database of outbound quotations assembled (together with my best guess as to their referent opinions), the next step was to attempt to match each quotation with specific language in the referent opinion. For each referent opinion in my corpus, this process entailed:

  1. Converting each inbound quote into a regular expression, in order to accommodate ellipses, brackets, and certain other common transformations (such as the conversion of double-quotes into single-quotes by the quoting text).
  2. Identifying a single “definite” match in the referenced text (i.e. the only possible matching passage) or in many cases, multiple “indefinite” matches (i.e., the quoted language occurred in multiple places in the referenced opinion).
  3. Resolving indefinite matches by assuming that if one of the indefinite matches was also a definite match for a different inbound quotation, it was sufficiently likely that the indefinite quote was targeting the same passage as the definite quote.
  4. Normalizing propositions to contiguous passages. Because opinions frequently choose different portions of a sentence to quote, each match was normalized to the entire sentence or set of contiguous sentences that contained all overlapping inbound quotations.

Presentation

The steps described above created a database containing, for each opinion in the corpus, every proposition in the opinion and the set of opinions referencing each proposition. In order to present this information in a usable format, I created LawProp.co, a website which allows a user to view the text of an opinion preceded by a list of the opinion’s propositions, sorted by importance (i.e., the number of times the proposition has been cited). Furthermore, the listed propositions are hyperlinked to their locations in the text of the opinion, and each proposition is highlighted in yellow where it occurs in the text. The resulting display is functionally similar to the headnote displays contained in WestLaw’s and Lexis’ online tools. Additionally, due to the highlighting, the result can be compared to reading a used law casebook, wherein the previous user has highlighted the key passages of each case. However, the highlighting in LawProp reflects the editorial views of judges, as opposed to law students.

Relatedness

LawProp.co contains several additional features, two of which I will describe. First, for any proposition, a user can view a list of “related” propositions. In order to create and sort a list of related propositions I used a form of co-citation proximity analysis, as described by Gipp and Beel.[3] For a given proposition P, I selected every other proposition Q which was cited in any opinion O in which P was cited, then assigned each Q a relatedness score of , where n was the number of characters between the reference to P and the reference to Q in O. As a consequence, propositions which are frequently cited together in the same opinions are found to be related, and this relatedness is enhanced where, for instance, the two propositions are frequently cited together in the same or adjacent paragraphs.

Second, I applied the same relatedness algorithm to opinions. On LawProp.co, for a given opinion, a user can see a list of “related opinions,” sorted by frequency of co-citation and co-citation proximity. While this feature does not leverage propositions, it does provide a useful alternative to the typical legal research methodology of traversing the directed graph of “citing opinions” and “opinions cited.” Whereas those methods describe an ancestor/descendant relationship between opinions, co-citation proximity analysis describes a “sibling” relationship, where the sibling status is established by the manner in which judges use the opinions in conjunction with each other.

Results

No statistical method for empirically determining the effectiveness of the algorithms described above was developed for this project. However, the user is invited to explore LawProp.co to assess the effectiveness of this method. A few typical results are described below.

For Roe v. Wade, 410 U.S. 113 (1973), the top three propositions identified are:

  1. “Where certain “fundamental rights” are involved, the Court has held that regulation limiting these rights may be justified only by a “compelling state interest”. . .”
  2. “. . . Pregnancy provides a classic justification for a conclusion of nonmootness. It truly could be “capable of repetition, yet evading review. “”
  3. “With respect to the State’s important and legitimate interest in the health of the mother, the “compelling” point, in the light of present medical knowledge, is at approximately the end of the first trimester. . .”

For the second proposition (regarding mootness), the following propositions are most related:

  1. “In a proceeding that may otherwise be deemed moot we have discretion to resolve an issue of continuing public interest that is likely to recur in other cases. . .” Daly v. Superior Court, 19 Cal. 3d 132 (1977).
  2. “Furthermore, the capable-of-repetition doctrine applies only in exceptional situations, and generally only where the named plaintiff can make a reasonable showing that he will again be subjected to the alleged illegality. . .” Los Angeles v. Lyons, 461 U.S. 95 (1983)
  3. “In Southern Pacific Terminal Co. v. ICC, 219 U.S. 498 (1911), where a challenged ICC order had expired, and in Moore v. Ogilvie, 394 U.S. 814 (1969), where petitioners sought to be certified as candidates in an election that had already been held, the Court expressed its concern that the defendants in those cases could be expected again to act contrary to the rights asserted by the particular named plaintiffs involved, and in each case the controversy was held not to be moot because the questions presented were “capable of repetition, yet evading review.””Sosna v. Iowa, 95 S. Ct. 553 (1975).

Future Work

Below I describe several possible areas for future work building on the methodology described above.

Categorizing Propositions by Topic

It would be beneficial for a legal researcher to be able to look up the leading propositions in his jurisdiction, relevant to a particular legal topic area. An automated solution to the problem of classifying propositions by topic would replicate the WestLaw KeyCite functionality without the cost of manual curation. It may be possible to cluster propositions by applying machine learning algorithms to the text of the propositions. Such an approach could include co-citation as an additional, weighted feature in the analysis. However, because most propositions are relatively short, and may not bear the markers of their topic area (which are instead contained in the surrounding context), categorizing the propositions would likely benefit from first categorizing the topic area of the source (and referencing) opinions themselves. Again, a machine learning clustering or classification algorithm may be well-suited for this purpose.

Accounting for the Recentness and Subsequent Treatment of Propositions

Ideally any tool that presents propositions would include a measure of the proposition’s authority as modified by subsequent favorable or negative treatment. My method simply counts the number of times that a given proposition has been cited. Although automating the “Shepardizing” of propositions will be challenging, a simple approach might take into account whether a proposition is being cited with increasing or decreasing frequency over time, as well as the relative importance of the citing opinions.

Improved access to judicial opinions.

This topic merits (and has received) considerable attention in its own right, but it is worth noting that the type of research described in this paper is significantly inhibited by the commercial control of the distribution of legal opinions. While the author was able to find a suitable corpus of legal materials for research purposes, the method described cannot be used for actual legal research without the availability of a comprehensive and current legal database. Furthermore, much of the time spent during implementation was devoted to parsing and resolving citations – work which unfortunately must be replicated by any researcher interested in analyzing the rich relationships between legal opinions encoded in the network of citation and co-citation. While projects such as CourtListener.com and the Free Law Project have made progress toward making this public domain information public, much work remains to be done. A complete solution to the commercial control of legal opinions would include the following requirements:

  1. Comprehensive, current coverage. So long as sites such as courtlistener.com have only partial coverage (measured both temporally and geographically), they cannot support the creation of innovative tools beyond the proof of concept stage.
  2. Bulk-downloadable. Courtlistener.com is laudable for allowing bulk downloads of its growing corpus of legal opinions, a feature that distinguishes it from the free availability of opinions on sites such as Google Scholar. Any solution to the general problem of public availability of legal data should allow complete programmatic access to the data to enable researchers and entrepreneurs to innovate on top of the dataset.
  3. Available for commercial and non-commercial use. Because the most useful innovations in legal research tools are likely to be motivated and backed by profit-seeking entrepreneurs and investors, these contributions will be accelerated by reducing the obstacles to entry for startups.
  4. Includes citation information. Finally, in order to leverage the connections between legal opinions, which are integral to the common law system, it must be possible to resolve these links using publicly available information. Currently, while many courts publish their opinions online, because they are not annotated with their reporter location, it is not possible to analyze these opinions as part of a broader network. The straightforward solution to this problem, going forward, is to abandon the system of citing to hard-bound “reporters,” publish each opinion with a Uniform Resource Identifier (URI), and cite to the same. Just as the text of legal opinions is public-domain, so are the connections between those opinions. The latter cannot be unlocked for the public, however, so long as the method of identifying opinions is mediated through a proprietary publishing system.

VII. Conclusion

Analyzing the pattern of quotations from one opinion to another provides, at a minimum, an easy-to-automate method of identifying the key passages contained within an opinion. This result may provide the basis for further innovation on improved tools for legal researchers as well as a cost-effective replacement for headnotes for legal research startups. The methods described in this paper are simple and were implemented in a few weeks by an amateur programmer. To the extent they are novel, they highlight how much innovation remains to be done in the area of legal research – innovation which would be vastly accelerated by providing researchers and entrepreneurs with unrestricted access to the law.

 

[1] http://www.lawprop.co/. This site is publicly available as of June 2015, but may be taken down in the future.

[2] This idea was inspired by Pablo Arredondo’s project at http://www.wellsettled.com/.

[3] B. Gipp and J. Beel. Citation Proximity Analysis (CPA) – A new approach for identifying related work based on Co-Citation Analysis. In B. Larsen and J. Leta, editors, Proceedings of the 12th International Conference on Scientometrics and Informetrics (ISSI’09), volume 2, pages 571–575, Rio de Janeiro (Brazil), July 2009. International Society for Scientometrics and Informetrics. ISSN 2175-1935.