AI Data Stewardship Framework

March 9, 2023
By
- Eran Kahana

Data is the lifeblood of generative AI applications. And while these applications depend on access to enormous amounts of it, that is not enough. At this point we can see the relationship between quality and quantity: Data quality and quantity are equally important; scarcity in one inevitably destabilizes the other. The dynamics of this relationship becomes most evident by the performance of these applications, they are ultimately only as good as the data they train on. There are various ways of defining the optimal data set, one that exhibits sufficient quality and quantity. Here I call it simply “high quality” data. It is obvious, therefore, that having and maintaining policies and procedures that are specifically designed to ensure high quality data is continuously provided is critical. I refer to this overall effort as the AI Data Stewardship Framework (AI-DSF).

To remain relevant, the AI-DSF is periodically updated. The Use Cases for the AI-DSF section offers examples of when reference to the framework makes sense. Finally, the Notes and Discussion section is reserved for taking a closer look at the various controls and at practical ways in which they can come into play.

Who is this for? The AI-DSF can be used by data brokers, data consumers, companies that build generative AI applications (and their boards), AI auditors, regulators (e.g., FTC), courts, and lawmakers. It is not limited to the use of data sourced from public domains and can be a valuable reference for enterprise applications that, for example, use Generative AI.

The AI-DSF structure is composed of: Basic Controls, Foundational Controls, and Organizational Controls. All of these controls are internal (the organization) and external (supply chain) facing. Supply chain members are subject to a meet-or-exceed standard that corresponds to the organization’s policies.

Basic Controls – Represent the baseline which every organization should have.

Data Governance – Policies, processes, procedures, and practices comply with relevant legal and regulatory requirements as well as relevant standards, best practices, and industry guidelines; data licensing requirements are routinely reviewed, updated, and enforced throughout relevant functions of the organization; all Basic and Foundational Control functions are coordinated and aligned with legal, regulatory, and contractual requirements; application design, development and life cycle is aligned with data dimension principles (“data dimension” is discussed in more detail below); relevant principles of the AI Life Cycle Core Principles are integrated into organizational policies, processes, procedures, and practices.
Data Inventory – Incudes control for data type – public, expert, or synthetic; Sets purpose, scope, roles, and responsibilities for data collection, creation, use, and retention; Identifies: (i) categories of external data sources (e.g., service providers, partners, customers); (ii) categories of internal data sources (e.g., employees, prospective employees, contractors); (iii) how data set diversity and sufficiency is established and monitored; and (iv) the data storage environment – geographic location, internal, cloud, third party; Data supply chain management practices are periodically reviewed and updated to comply with relevant laws, regulations, best practices, and organizational risk tolerance; Addresses data life cycle, including compliance with legal and other variables; Data licensing requirements are reviewed and complied with prior to dataset ingestion.
Continuous Data Vulnerability Management – Data quality baseline is set and regularly monitored; Pre-training and post-training procedures are executed, documented, and measured. Proven anomaly detection tools are continuously used; Detected data anomalies are analyzed to determine cause, vector, and assess operational and legal impact (necessary for the Data Incident Response Plan); Corrective action is implemented and documented within established reasonable timeframes; Data deletion and data unlearning methodologies are readily available and implementable; Implements policies and procedures to regularly test for alignment with enabling explainability (XAI); Policies and procedures are designed with reference to well-accepted standards.
Data Incident Response Plan – A formal structure for planning and implementing an effective response to a detected data vulnerability.
Secure Configuration for Data – Aligns with internal (e.g., IT policy) and external (e.g., supply chain management) policies and procedures; Cybersecurity environment is maintained in accordance with well-defined and accepted standards.
Maintenance, Monitoring, and Analysis of Data – Roles and responsibilities are identified and formalized for human oversight; Acceptable Use policies are in place and enforced; A lessons-learned approach is implemented for analyzing vulnerabilities, threats, and incidents of compromise; Periodic use of Data Protection Impact Assessment (DPIA).
Note: Look for vendor documentation that details how data is managed throughout its life cycle. It should consist of at least a basic set of controls that comprise a set of essential measures. They describe: (i) the steps the organization follows to comply with applicable laws, regulations, industry standards, and best practices; (ii) its data licensing practices; (iii) how the first two tasks are reflected in actual contractual practices; (iv) how the application design and development activities are carried out to ensure proper data handling; and (v) which AI core principles (e.g., transparency, accountability, safety, security, and ethics) are embedded into the organization’s policies, procedures, and processes. If the vendor’s answer is in the affirmative, make sure to look for examples that demonstrate adherence to the framework is systemic, not a one-off display effort.

Foundational Controls – The next step up from the basic controls – can be useful for organizations that identify they have a higher risk profile.

Data Storage Protections – Physical environment monitoring is consistent with well established best practices and/or standards, such as NIST SP 800-53, Cybersecurity Framework, ISO/IEC 27001.
Data Threat Defenses – Implements and maintains processes and procedures to identify and mitigate threats from internal (e.g., employees) and external threats (e.g., hackers); Incorporates threat and vulnerability information from information sharing sources; Reference and periodic updates made in reference to ISO 27001, 27014, and NIST AI 100-2e2025 Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations.
Data Provenance Protections – Employs blockchain-based security mechanisms; Where alternative or compensating security methods are used, sufficient documentation is available to demonstrate reasonableness of selection; Implements guardrails to protect against use of unlicensed, unverified, and unintended data sets; Implements protection against use of unlicensed output data; Implements protection against using outdated data; Implements safeguards against data poisoning; Data licensing practices are regularly reviewed to ensure alignment with intended use.
Secure Configuration for all Data Sources – All computing and data storage assets are aligned with security configurations in accordance with the organization’s IT policy.
Data Sources Boundary Defense – Data flow is continuously monitored; Only pre-approved external data sources are permitted access and use.
Controlled Access to Data Sources – Access is restricted to pre-approved data sources.
Audit and Control – Internal and external (supply chain) periodic audit of all Foundational Controls; findings are regularly provided to senior management and board of directors; Identified vulnerabilities are promptly dealt with and documented.

Organizational Controls – Focused on people and processes.

Implement Data Stewardship Program – The program has formal, documented support from senior management and is subject to periodic review and audit by the board of directors and outside auditors.
Data Incident Response Management – Data incidents are managed through a formal, documented process that is reviewed by relevant members of senior management.
Fuzzing Tests and other Red Team Exercise – Vulnerabilities in the Basic and Foundational Controls are periodically tested, documented and reported to senior management.

Purpose

The AI-DSF aims to formalize the generation, supply, protection and use of high quality data for training algorithms. Doing something once or a handful of times will not yield a meaningful, positive result, especially when it comes to the complex undertaking detailed in this framework. This is underscored by the term “formalize.” It shows the importance of rendering all the stewardship tasks as something that is repeatable and measurable. Underlying all of this effort is effective governance. Senior leadership buy-in, including the board of directors is essential and a central characteristic of the framework’s formal nature.

“High quality” is a complex attribute in that it contains numerous variables, or “dimensions.” MIT researchers Richard Wang and Lisa Guarascio list 20 data dimensions: Believability, value added, relevancy, accuracy, interpretability, ease of understanding, accessibility, objectivity, timeliness, completeness, traceability, reputation, representational consistency, cost effectiveness, ease of operation, variety of data & data sources, concise, access security, appropriate amount of data, and flexibility. (See, Dimensions of Data Quality: Toward Quality Data by Design 1991.) I don’t consider this an exhaustive list and it may even benefit from other changes, such as adding “integrity,” “resilient,” and “hygienic.” But it is an important reference tool for determining what exactly we need to pay attention to in this framework.

Use Cases for the AI-DSF

The AI-DSF can be used by developers in the development gating process. Among the inherent benefits here is that it can help simplify compliance with some of the developer’s contractual obligations. For instance, a typical provision requires that the developer represent and warrant that the application is in compliance with all applicable law and regulations. While this is a very broad requirement, the AI-DSF can help meet part of that obligation insofar as it relates to the qualities of the data that impact privacy, copyright, etc.
Service level agreements (SLA): An application developer could employ it for determining how to better draft the data supply agreement and the SLA with the data broker. This helps ensure the data delivery is relevant and responsive to the needs of the algorithm. Why is this important? Consider, for example, a scenario where the data broker delivered data that is accurate, concise, and free of data poisoning, but it was not timely delivered. That’s a huge problem. Training an algorithm on data that is outdated can be devastating and so appropriate contractual provisions are vital to mitigate that risk.
Legal rights: Licensees typically rely on a representation and warranty by the licensor as to the quality of the data: the rights in the data (e.g., having an appropriate license), provenance, etc. The AI-DSF provides a useful framework for ensuring the data provided is high quality, not tainted with legal problems such as infringement or other undesirable variables such as offensive, inaccurate, or controversial data.
Guide service agreement negotiations: Subscribers in a service agreement can use the AI-DSF to contractually obligate the provider to maintain select policies and procedures that ensure data integrity. For example, add a requirement to implement and maintain protection against using outdated data.
Compliance with privacy law: The privacy landscape is increasingly complicated. The absence of a federal law in the U.S., has driven more and more states to put privacy laws on the books. This is also a hot topic for regulatory bodies (e.g., the FTC) and for their part they remain vigilant and periodically sharpen their and guidance and enforcement efforts. On the international front, regulations such as the General Data Protection Regulation (GDPR) create additional layers of complexity that must be complied with. In this setting, adhering to the AI-DSF becomes a useful guide for helping mitigate the risk inherent in using data that might contain private information.

Notes and Discussion

The AI-DSF design references in its design: NIST Cybersecurity Framework; NIST Privacy Framework; NIST AI Risk Management Framework; CIS-20 Cybersecurity Controls; Information Commissioner’s Office AI Auditing Framework (draft guidance for consultation).
Data Governance – There are two driving principles here: (1) enable and (2) demonstrate compliance with applicable legal, regulatory, and best practices. The latter can be of particular importance in the organization’s routine operations, such as responding to an RFP, complying with investor and other stakeholder demands.
Blocking the ingestion of malicious content is one of the functions to be executed in the Basic and Foundational controls. Under the former, this comes into play primarily as part of the Data Inventory task. And under the latter this process takes on a larger role, which we can see in every one of the subtasks. The Data Storage Protections task prevents attackers from gaining access and tampering with the datasets. Data Threat Defenses deals with processes that identity and stop malicious content threats, regardless of origin. Data Provenance Protections is in charge of continuous data origin and integrity checking. The Secure Configuration for Data Sources hardens the underlying systems against malicious content attacks. The Data Sources Boundary Defense and Controlled Access to Data Sources work in unison, forming a security perimeter that monitors all data flows and permits access only to white-listed sources. Finally, the Audit and Control task functions as the verification layer, which ensures the system’s defenses are consistently up to the task.
Blockchain-based security mechanisms are one way to secure data provenance. Where an organization chooses a different method to secure data provenance, it will likely be important to ensure that there is sufficient documentation that explains the rationale for its use. Ultimately, the question will be whether the selected method is reasonable considering all the surrounding circumstances that led to its selection and implementation.
A DPIA is an essential tool for demonstrating that the organization is using legally reasonable means for risk assessment of data used by AI.
The OECD Framework for the Classification of AI Systems addresses data and input. It has “Collection” criteria and “Rights & Identifiability” criteria. Missing from this framework is reference to the Basic Controls of the AI-DSF.
The Data Incident Response Plan guides the response so as to minimize the potential for degraded data materially impacting the AI application(s). A data incident can be defined in a number of ways. For example: Any unauthorized instance in which there is an attempt, whether successful or not, to access the data. Limiting the inquiry to whether or not there was access, as opposed to whether data was tampered with, is but one response option. Selecting it may be reasonable depending on the incident’s surrounding circumstances and the severity level that is assigned. Severity levels are typically assigned from level one to three. Level one is the most minor and used where the incident can be dealt with internally and has no impact on the organization’s operation. A level three is the most severe. Regulatory agencies and/or law enforcement need to be involved, multiple stakeholders are affected, and there is a material risk to the ongoing normal operation of the business. A Data Incident Response Plan can be a subset of the organization’s overall incident response plan or be a separate document. However it is implemented, it is important to confirm there are no gaps with other relevant policies, such as the IT policy.
Supply chain members are subject to a meet-or-exceed standard that corresponds to the organization’s policies. This requires identifying specific requirements in the contract with the supplier. For example, if the organization uses SHA-256 for encryption, the supplier must either use the same or a better protocol.
Pre-approved data sources are those that are subject to contractual requirements with the data provider.
Timely detecting data vulnerability plays an essential part in properly managing the organization’s risk. For example, demonstrating that your organization regularly looks for and effectively responds to indications of compromise can be valuable for minimizing/eliminating liability in the event of a regulatory inquiry.
Data Provenance Protections – A key function is protection against use of unlicensed output data which is where an LLM is trained on the output of another LLM. While this practice may have some engineering upside in the form of cutting the amount of training time, the potential legal downside may be significant. From a copyright perspective, as long as such training falls under fair use, the question of liability for infringement can be set aside. But things get murkier from a contractual perspective, where the output data is protected by terms of use. In this setting, training on output data without permission can trigger a breach of contract claim.
OpenAI has until April 30 to comply with Italian regulator data privacy requirements. It is uncertain (to say the least) that OpenAI will be able to comply. France, Germany, Ireland, and Canada are also looking into OpenAI’s data collection and use practices. And there’s more trouble brewing with the European Data Protection Board (EDPB) also setting its sights on OpenAI. All developers of generative AI apps need to pay close attention to these developments. An important takeaway here is for developers to make sure they comply with the AI Data Stewardship Framework or something similar. It may be the only way to satisfy data privacy legal requirements. Update 5-2-2023: TechCrunch reports that ChatGPT is back in Italy. OpenAI made changes to the way it presents its service to Italian users, including an age verification pop-up and alerting the user to a privacy policy with links to a help center. Will the Italy experience guide the way out of other EU/EDPB challenges to generative AI? Maybe. What is also of interest is how this all ties in with another topic I first wrote about in 2011 under the title Maximizing Representative Efficacy – Part I and with its Part II which came out shortly thereafter. Now, Part II is relatively more relevant to this discussion, but Part I sets the stage, building on an excellent law review by Richard Craswell, Taking Information Seriously: Misrepresentation and Nondisclosure in Contract Law and Elsewhere. Part II is particularly relevant here because it describes the role of AI apps in helping users understand terms of service. Considering the latest seismic changes in public awareness and apprehension of AI, the role of these apps (the good ones anyway) will be vital for making these services more accessible and compliant with what will likely be rigorous regulatory and enforcement efforts.
Data Incident Response Plan – How does the organization respond and contain an external-based attack vector that pollutes its LLM data set? The manner of response is dependent and guided by the degree of model adherence to AI Life Cycle Core Principle variables such as Resilience. For example, in a model with a high degree of Resilience, the organization may have more time to alert its end user base of the incident and minimize the degree of harm. This can be an important way to maximize the ability to comply with service level agreements and other contractual obligations.
Techniques such as Word Prediction Accuracy (WPA) provide an example of operationalizing part of the Maintenance, Monitoring, and Analysis of Data function. See https://lilt.com/research.
The Continuous Data Vulnerability Management function calls for the availability and implementation of reliable data deletion methodology such as is provided by Machine Unlearning (MU). For more on MU, see Machine Unlearning: its nature, scope, and importance for a “delete culture.” Within this function there are also processes that deal with ongoing validation at various checkpoints during the application’s development and, if relevant, post-deployment.
Data Provenance Control tackles challenges such as copyright infringement. When it comes to training foundation models, there are two primary focal points here: (i) Whether it is permissible to use the data for training and (ii) does the output infringe? Permissible training occurs when it is on data that is not subject to copyright, the training is carried out pursuant to a license (or an implied license such as the absence of robots.txt or other retstriction), or when the use of the data set qualifies as fair use. As for the output, it does not infringe if it qualifies as fair use. In this note we consider what infringement guardrails look like in the context of fair use and for this we focus on the transformative factor of fair use. We can see that the transformative variable may be much too dependent on the quality of the end user’s prompt. Now, suppose that it’s true that a sophisticated prompt (e.g., one with significant semantic texture) is more likely to generate output that qualifies as transformative than one done coming from a simple prompt (lacking significant semantic texture). If that is the case, we can use the prompt’s quality as a factor in scoring the probability of infringing output. This is accomplished by training an algorithm to measure and score the quality of a given prompt. The prompt’s score is the product of comparing the prompt to others that have similar characteristics and this is done by reference to an ontology of prompts. Again, presuming it is true that sophisticated prompts are more likely to yield non-infringing output, the prompt scoring approach can help bring us closer to spotting generative output that more likely passes the transformative test in fair use. Model providers could then put in place prompt guardrails that help the end user lower the probability of generating output with a low transformative score. This would include protection against prompts that are known or are likely to generate infringing output. Finally, model providers that opt not to use this prompt scoring approach would not benefit from a safe harbor (if and once something like that is set).
Data Provenance Protections contains a requirement for implementing guardrails to protect against use of unintended data sets. Part of the focus here is to identify whether the AI model is using personal data unintentionally. If it is, appropriate actions need to be taken to prevent use of this model until there is assurance that the personal data has been removed. The use of personal data, even if it is unintentional, is irrelevant to the question of the need to comply with applicable laws and regulations. The GDPR, for example, has a number of articles that apply (e.g., Articles 4, 5, 6, 7, 22). Being unaware of their applicability creates unmitigated risk.
Dataset diversity is a key feature of the Data Inventory Controls. It is important for enhancing model alignment with the Fairness life cycle core principle, a principle which aims to manage against unintended disparate treatment and reduce unexpected outcomes. As such, insufficient or lack of dataset diversity can be regarded as a data quality deficiency and risk which can have undesirable downstream consequences. This makes dataset diversity a key contracting checkpoint for end users. In this setting, contractually obligating the system provider to exercise best practices in controlling for dataset diversity helps the end user better manage its risk, internally (e.g., for its employees) and externally (e.g., its licensees).
How were the humans selected for the Reinforcement Learning from Human Feedback (RLHF) task? This is an important question because RLHF plays an important role in teaching the AI agent how to improve its decision-making capabilities. Accordingly, this question should be asked during either the vendor selection process (RFP) or at the contract negotiation phase. A developer that can disclose its RLHF methodology is more effectively aligned with the Transparency core principle. Additionally, this inquiry also aligns both parties, the developer and end user, with the Ethics life cycle core principle, which calls for (in relevant part) the implementation of practices that manifest socially beneficial conduct.
The AI-DSF enables most of the AI life cycle core principles. For example: Accountability, Accuracy, XAI, Fidelity, Governance, Permit, Privacy, Relevant, Security, Transparency, and Trustworthy. Adhering to the AI-DSF, or a similar data stewardship framework, therefore, can help developers and end users reduce their risk in using AI.
The Federal Trade Commission (FTC) has at its disposal the remedy of algorithmic disgorgement to enforce deletion of algorithms created using data for which the developer had no rights or which is in violation of law. Such an outcome can have significantly expensive, even devastating consequences for the developer, and it can also negatively impact the developer’s end users that were using the “tainted” application, potentially even further magnifying the developer’s liability exposure. The AI-DSF helps solve this problem. Properly implemented, the AI-DSF helps the developer ensure it is not using tainted data. And apart from the benefit of not being subject to algorithmic disgorgement, the developer can adopt licensee-friendly terms and conditions with respect to infringement indemnification and defense, which can help differentiate its services from those of its competitors.
Limiting training to datasets comprised of: (i) what has been licensed by the organization; (ii) its own creations; (iii) works in the public domain, and (iv) moderated generative AI content coalesces to ensure the organization is not infringing on copyrighted works of others. The benefits of this type of practice also extend beyond the organization. In the case of AI service providers, the benefit of this risk reduction approach can also be flowed down to its end users in the form of licensee-friendly liability mitigating terms and conditions, namely the infringement indemnification.
The cybersecurity requirement in the Secure Configuration for Data foundational control calls for a regime designed in accordance with standards. The term “standards” is loosely used. It includes ISO 27001 (one of the best known standards), NIST Cybersecurity Framework, CIS Critical Security Controls, PCI DSS, etc. The intent is to ensure that the organization is using a widely accepted data security methodology and not a one-off, informal approach. The more grounded the cybersecurity regime is, the better it aligns with various AI Life Cycle Core Principles, such as: Accountability, Accuracy, Big Data, Ethics, XAI, Fairness, Human Centered, Privacy, Relevant, Reliability, Resilience, Robust, Safety, and Security.
Developers that follow the data licensing guideline (see Data Governance) are in a good position to offer their licensees substantive infringement indemnification and defense protections. This practice is valuable. It can help differentiate developers from competitors that do not follow this process and rely on sweeping liability disclaimers.
Governance and board oversight are critical for ensuring proper implementation of the AI-DSF. For such oversight to be effective, senior management and the relevant board members should have sufficient understanding of the various controls and functions of the AI-DSF, receive periodic reports (as required by the controls), and routinely confirm that the organization has sufficient resources to maintain the AI-DSF. Additionally, a hallmark of effective Data Governance is an operational environment that only uses data that is relevant to the authorized activity. An “authorized activity” derives from a formal agreement (contract) with another party. This is not limited to a B2B agreement and includes legally-valid consent obtained from an individual.
Within the Data Provenance Protections controls there is an emphasis on avoiding challenging, potentially expensive, even devastating scenarios that arise from regulatory or court ordered model disgorgement. It is important to keep in mind that the destructive effect of model disgorgement can extend to the developer’s end users (subscribers). If that occurs, additional liability can pop up due to a breach of contract, subscriber loss, and reputational damage. All said, the risk of failing to adhere to the Data Provenance Protections is so clear and significant that it should be a subject that is regularly inquired upon by the developing company’s senior leadership. (See also Notes 20 and 24.)
A key part of the Continuous Data Vulnerability Management control is dealing with data anomalies. Toxic data is considered a data anomaly. When it occurs, the Data Incident Plan should be referenced in the effort to determine cause, vector, and assess operational and legal impact. In addition, corrective actions measures should be taken to prevent recurrence.
The AI-DSF is designed to help maintain the developer’s alignment with multiple AI Life Cycle Core Principles. It supports, for example, compliance with the Ethics principle by protecting against misinformation and disinformation. Protecting against both of these types of harms is part of the effort to focus on developing and maintaining AI applications that are socially beneficial.
The absence of effective contracting practices with data supply chain members can degrade compliance with the requirements under the Data Inventory Controls. Take, for example, the use of poorly drafted service level agreements. This could be considered as a sign of misalignment with data life cycle management best practices. If that practice is identified (ideally through the periodic review requirements under this control group), a thorough investigation should be conducted into all data supply management practices (not just those relating to contracting).
Statistical exploration of data used/to be used is part of the Continuous Data Vulnerability Management control. Policies and procedures under this control should reference well-accepted standards such as ISO/IEC 42001.
Requiring end users to routinely and manually check generative AI output is important. (This can also be performed via red teaming.) It is part of the Acceptable Use policy which is required by the Maintenance, Monitoring, and Analysis of Data control.
The AI-DSF aligns with the Fair Information Principles (FIPS). FIPS came out in the early 70’s and remains monumentally important for maintaining data privacy. It is built around seven principles that are required of entities that collect and process personal information: (1) placing limits on information use; (2) formalizing data minimization; (3) limiting disclosure of personal information; (4) collecting and using only information that is accurate, relevant, and up-to-date; (5) enabling individuals with notice, access, and correction rights; (6) building transparent data processing systems; and (7) providing security for personal information.
In the NIST Privacy Framework, reference to the Data Governance control can (also) be found in the CT.PO-P4 subcategory of the “Control” function. Instead of using “govern,” NIST refers to this function as “data life cycle manage[ment].” The “life cycle” here can be seen as cautionary, steering organizations away from adopting a one-off approach to the tasks required under Data Governance.
Satisfying the Data Sources Boundary Defense and the Controlled Access to Data Sources controls requires (among other things) effective data supply chain management practices. For example, a Network Access Agreement or similar contract should be in place before any access is granted. The term of the agreement should be carefully considered, with the default choice being a shorter duration and renewal of the agreement conditioned on the data supplier satisfying all security requirements. This approach helps prevent scenarios where suppliers have outdated or otherwise poorly managed access to data assets.
The American Privacy Rights Act (APRA) is the latest legislative effort aimed at drafting a federal privacy law. Even if APRA doesn’t make the cut, there are still things that can be gleaned from it and so it’s worth looking into. Some parts of it may ultimately find a home in whatever the final much-talked-about federal privacy act ends up being. In APRA we can see an emphasis on data minimization and accountability. The former (unsurprisingly) is one of the key characteristics found in the Privacy life cycle core principle and the latter is in and of itself a life cycle core principle, one that contains nearly 10 characteristics. (For more on Accountability, see here.) APRA makes it clear that compliance will necessitate reference to a formal data stewardship framework; a one-off attitude will not make the cut. As such, APRA makes the case for using the AI-DSF. It is also important to highlight the importance of the Governance core principle as it is vital for ensuring APRA’s emphasis on data minimization and accountability is effectively operationalized.
NIST recently released two updated data protection guidelines that are relevant to the Basic Controls in the AI-DSF: Protecting Controlled Unclassified Information in Non-Federal Systems and Organizations (NIST SP 800-171, Rev. 3) and its companion Assessing Security Requirements for Controlled Unclassified Information (NIST SP 800-171A, Rev 3). Controlled unclassified information (CUI) includes various categories of data such as, health information, intellectual property, personnel records, and general proprietary business information. (A complete list of CUI categories and their description is available at the National Archives.) While these publications are aimed at assisting federal agencies with setting and managing contractual data protection requirements with their contractors, they contain valuable guidelines that can be used by the private sector as well. One of the useful guidelines found in SP 800-171 relates to data security requirements, which contains 17 sections. Here I highlight just Section 3.17 – Supply Chain Risk Management, specifically the Discussion section. It walks through the organization’s risk in its dependence on its supply chain and the need to protect against threats such as tampering, malware, and counterfeiting. As such, the guidance in 3.17 can serve as a useful reference for drafting the policies, processes, procedures, and practices that make up the Basic Controls.
Part of the Basic Controls requires routine review, update, and enforcement of data licensing requirements. To effectively execute this, organizations need to have in place written policies that direct this activity. Absent such policies, the effort is doomed as it will be unsustainable. Alignment with the Governance core principle helps make sure that does not happen. It ensures the effort contains the hallmarks of a mature and legally reasonable approach, one that is systemic, repeatable. Everything that is the opposite of a one-off exercise. The review, update, and enforcement of these requirements is also driven by the contractual agreement the organization has with the data licensor (supplier). As such, it is also important to make sure that the routine review part of the effort is not only focused on the contract itself, but also more broadly at examining the contracting practices that produce the contracts and making adjustments as necessary.
Data exfiltrated from cybersecurity breaches frequently ends up on the dark web. The data is then sold to other malicious actors and ultimately propagates throughout the internet, polluting it. Take the National Public Data breach, for example. It happened months ago, but we are only learning about it now; and it’s possibly the largest breach ever. Just how big? Well, the number varies somewhat, but so far it looks like between 2.5 and 2.9 billion records were compromised. Full names, social security numbers, email addresses, phone numbers, and mailing address. The potential for damage here is concomitant with the size of the affected population. Aside from the cybersecurity and privacy issues these breaches bring about, these incidents (and not just the big ones) are a useful reminder to take AI data stewardship seriously. The Basic Controls in the AI-DSF serve to ensure that data ingested by the AI model is controlled, is clean, unpolluted by breach-sourced data.
Most AI-specific legislation in the pipeline provide individuals negatively impacted by these so-called “automated decision-making” systems with the right to correct their data. In theory, it sounds good, but that’s about it. Suppose the individual contacts the company’s customer service with this request. The likelihood they will know what to do with the request is low, more likely zero. However, if the company in question follows the AI-DSF or a similar framework, it stands a better chance of complying.
Guarding against model inference and inversion attacks can be enhanced through the AI-DSF through the Basic Controls and Foundational Controls. Here, the Data Governance and Data Inventory functions in the Basic Controls and the Audit and Control in the Foundational Controls work together. They help protect against these attacks not via technical means, but through appropriate contractual provisions with the provider. It’s essentially externalizing the risk, shifting it to the provider, making it warrant that the training dataset is not vulnerable to these attacks. If the provider uses the AI-DSF, or a similar framework, they should be in a good position to warrant as much. And if they don’t use such a framework, then it begs the question: Why? How are they managing quality?
The Department of Justice recently updated its Evaluation of Corporate Compliance Programs (ECCP). The ECCP helps prosecutors assess how well an organization’s compliance program works. It’s important because it influences their decisions on whether to press charges, what sentences to recommend, and the best way to resolve cases involving corporate crimes. One of the items in the latest ECCP relates to data access and this is directly on point with the AI-DSF. Specifically, “[h]ow is the company managing the quality of its data sources.” (See page 13.) Though the setting here is one dealing with criminal investigations, the logic remains sound in other settings. Failure to have and implement the AI-DSF (or a similar formal framework) is not going to be easy to explain.
The FTC’s stance on data management provides additional support for implementing the AI-DSF. In its technology blog post “Lenses of security: Preventing and mitigating digital security risks through data management, software development, and product designs for humans” the FTC leaves no room for doubt: Failing to implement a data stewardship program is inexcusable. This is especially clear in the section titled “Security in data management.” I often share the following story in my lectures and presentations. It recounts a meeting I was attending in which the marketing team was being grilled by the CEO on a product design that they missed. The company’s primary competitor was going to eat the company’s lunch. I recall the CEO asking the head of marketing: “How did you not know this?” It’s a scary question. I urge you to think about it in the context of how your company deals with data management. Do you have a good answer?
Thomson Retuers v. Ross Intelligence is a useful example of the importance of using (not just having) a formal data stewardship framework. In their effort to build their AI legal search application, Ross asked Thomson Reuters for a license to use its Westlaw headnotes and Key Number System. After Thomson Reuters refused, Ross hired LegalEase to generate summaries of legal issues and identify relevant cases which would then be used to train the Ross application. LegalEase generated the training data by referencing Westlaw headnotes and Ross used that to build its AI legal search application. The district court found that Ross infringed on Thomson Reuters’ copyright and Ross subsequently went out of business. While this is not a generative AI case, it is the first time that a federal court addressed fair use as a defense in the context of training an AI natural language processing (NLP) application. The difference is significant, and this case may turn out to be limited to just that, NLP. But the data stewardship principles are valid here as well. By observing the AI-DSF controls and the relevant AI life cycle core principles, AI developers can minimize the likelihood that they’re infringing.
A well-drafted indemnification and defense provision isn’t enough. It can create and perpetuate an illusion of safety in the indemnified party. Suppose that during their contract negotiations a deployer demanded and the developer immediately agreed to their indemnity and defense language. No negotiation. No changes. In my experience this rarely happens. That’s because indemnity and defense is an obligation that has all the makings of a heavy-duty financial undertaking. Therefore, the party called on to provide it will seek to minimize their risk profile as much as possible and this can result in some potentially extensive back and forth rounds of negotiation. When there’s no pushback, however, that should give the deployer pause. Since there are no language issues to grapple with, the next inquiry for the deployer to dive into is whether the developer actually has the ability to deliver on its indemnity and defense obligations. In this process, it’s worthwhile turning to the Wherewithal life cycle core principle. Among its components is the call to confirm the developer is financially sound. Since a financial audit can, but isn’t usually undertaken in this process, the next thing to look for is the developer’s insurance. In this process, it is important to determine if they have enough coverage (type and amount), that the insurance is from a financially stable provider (measured by their AM Best rating), that there’s a waiver of subrogation, and that the developer has an obligation to renew the policy at equal or better terms. Absent such a diligent approach, the indemnification and defense language they agreed to is meaningless. Now let’s bring all of this into the Thomson Reuters setting. Did Ross demand indemnity and defense obligations from LegalEase? If it did, did it also reference guidelines such as those found the Wherewithal core principle? Considering that it went out of business due to this legal battle, it’s a safe bet the answer is in the negative. Considering that it went out of business due to this legal battle serves as a powerful case for following the Wherewithal core principle.
ISO/IEC 5259 deals with AI data quality and serves as a key reference for implementing the AI-DSF. As a standard, a key implementation characteristic it provides is formality. The best way to understand “formality” is to remember it is the opposite of a one-off, ad hoc effort. Formality is essential for putting in practice the all the necessary processes and procedures. It’s also essential for projecting due diligence, which helps mitigate risk. The lack of formality opens the door to tough questions such as “why didn’t the company do this again” which tend to sabotage efforts to portray the company as being diligent. Consider this also from the perspective of an acquisition. Observance of formality can help the company position itself as a more attractive candidate than one that seemingly doesn’t have its act together.
Without strong controls around information security there is virtually no prospect for aligning with the AI-DSF’s goal of ensuring high-quality data. The absence of these controls should lead to the conclusion that the organization’s data is already compromised. Having evidence that this is so just reinforces what’s already known. This is, of course, an unacceptable situation and quite likely legally negligent. This is why we see in the Foundational Controls mention of ISO 27014, a standard dedicated to information security governance. This directly ties into the Governance life cycle core principle, which is mentioned frequently in the discussion above — and which bears mentioning again — is the single most important AI life cycle core principle.