Elasticity Rebooted: Observations from Pilot Testing
by Dr. Megan Ma, Assistant Director, CodeX; and Klaudia Galka, CodeX Affiliate and VP, Underwriting, Swiss Re
In our prior work, we set out to develop a diagnostic tool for elasticity. We argued that elasticity, otherwise understood as strategic or intentional vagueness, offers contractual parties the linguistic space to manage the unknown. Furthermore, we suggested that elasticity is a marker of the implicit relationship between parties. Embedded in the language, then, is the DNA of contractual party trust and their inherent negotiation power. Therefore, by understanding the role of vagueness in contracts, and in effect, making the implicit explicit, we may be able to offer insights to encourage more intentional drafting and certainty at the contract performance stage.
Our research revealed that linguistic clues, known as epistemic stretchers, implicitly signal the preferences of contractual parties. Notably, these stretchers are capable of revealing intention and strategic uses of vagueness. We developed an abstract framework and assigned a qualitative score to components of contractual clauses. We described this as a structured knowledge model, serving as the semantic fingerprint behind contractual language.
We constructed an early prototype of the tool with our then Computable Contracts Developer, William O’Hanley, known as the Stretch Factor.
We envisioned that not only would the diagnostic tool act as a mechanism for identifying areas of intentional and strategic vagueness, but would also highlight areas where context may be better formalized to facilitate implementation of other symbolic approaches such as computable contracts. Moreover, we argued that, by articulating the reason behind elasticity and contractual wording in the longer term, contract clauses would not only be more precise, but also capable of acting as a lens into party behavior.
As we discussed what might be the next phase of this project, the advent of large language models (LLMs) began to play a particularly significant role. That is, the benefits of their scalability and user-friendliness could not be overlooked. In effect, we envisioned that LLMs may be the right testing grounds to evaluate the impact of measuring elasticity. We hypothesized that results and observations from testing on LLMs could provide us information on not only the generalizability of our framework, but also its significance relative to risk. Moreover, we may be able to gather deeper insights into contractual drafting processes. For example, in what circumstances could we imagine drafting a low elasticity clause? Is there a tendency to prefer medium elasticity as the default drafting standard? More specifically, is there a difference in the uses of elasticity when used in the clauses and subjects which are well-established versus new, emerging risks which are simply not tested yet e.g., Cyber Exclusions?
And so, we did exactly that, but an additional step further. We decided to construct the next iteration of the Stretch Factor and launch an early pilot, testing its practical use for reinsurance lawyers. Leveraging the platform, Poe, developed by Quora, we were able to build chatbots on top of three different LLMs: (1) OpenAI’s GPT-4; (2) Anthropic’s Claude-2 100K; and (3) Meta’s open-source model Llama-2 70b.
The first stage of this pilot had several goals in mind. First, we wanted to experiment with how different LLMs react to our Semantic Fingerprint. Second, we wanted to extend beyond more standard clauses (i.e., “Access to Records”) and determine whether our framework could be applied to more complex clauses. In this case, we used the Cyber Exclusion. Third, we wanted the tool to be highly usable, such that any clause could be taken from an existing contract and/or policy, and an elasticity analysis, based on our Semantic Fingerprint, would be produced. Unlike our earlier prototype, we wanted to adapt our tool from a simple markup to a report, which we thought would be more informative for the user.
We then engaged in a phase of alpha testing across a small sample of senior underwriting and contract experts. We had provided them web access to our three Stretch Factor chatbots and recorded observations of their interaction via user interviews. We determined that our ideal end user, at this stage, is a reinsurance lawyer with a minimum of 5 years of industry experience. Our reasoning behind this decision is two-fold: (1) we wanted to provide a benchmark around the performance of the tool; and (2) we wanted to further establish this tool as a helpful metric that encourages human-machine collaboration.
Sample tests from each of the chatbots:
Below we enumerate a few early observations:
- There was a broad consensus that the tool is considerably effective. Most of our users felt it was not only useful, but rather impressive. As they neared the period where the bulk of their client contracts needed to be renegotiated and renewed, the Stretch Factor could help provide a valuable second-pair of (machine) “eyes” to help assess whether there were any risks that were not intended to be part of the policy coverage.
- Some users were fascinated with how this tool could clarify the extent of “stretchiness” of their clauses. That is, they felt they were able to better understand the linguistic elasticity of their words and have better control over just how elastic they would like their clause to be. In fact, some confirmed that reinsurance clauses necessarily required some level of elasticity, but that they would prefer it to be consistent with underwriting standards. In effect, having an elasticity score encourages a metric that could help inform underwriters whether they had reasonably drafted the clause within their company requirements.
- One particularly enthusiastic user was excited about the prospects of this tool for the creation of risk hypotheticals, enabling the ability to conceptualize risk at the contract level.
While these observations were generally positive and highly encouraging for this phase of our pilot, we remain cautious given that the sample size of our user testing was rather small. More importantly, the user interviews did not clarify much about the differences between the various chatbots. These observations could be rather informative, provided that these chatbots were built on three different LLMs. In particular, one of the chatbots was built with Llama-2 70b, with the goal of exploring the potential of using open-source models and reducing vendor dependence.
Therefore, we anticipate that in the next round of user testing, we would provide more focused questions to our users and extend the pilot to more underwriting and contract experts across the insurance value chain.