AI Data Stewardship Framework

Data is the lifeblood of generative AI applications. And while these applications depend on access to enormous amounts of it, that is not enough. At this point we can see the relationship between quality and quantity: Data quality and quantity are equally important; scarcity in one inevitably destabilizes the other. The dynamics of this relationship becomes most evident by the performance of these applications, they are ultimately only as good as the data they train on. There are various ways of defining the optimal data set, one that exhibits sufficient quality and quantity. Here I call it simply “high quality” data.  It is obvious, therefore, that having and maintaining policies and procedures that are specifically designed to ensure high quality data is continuously provided is critical. I refer to this overall effort as the AI Data Stewardship Framework (AI-DSF).

To remain relevant, the AI-DSF is periodically updated. The Use Cases for the AI-DSF section offers examples of when reference to the framework makes sense. Finally, the Notes and Discussion section is reserved for taking a closer look at the various controls and at practical ways in which they can come into play.

Who is this for? The AI-DSF can be used by data brokers, data consumers, companies that build generative AI applications (and their boards), AI auditors, regulators (e.g., FTC), courts, and lawmakers. It is not limited to the use of data sourced from public domains and can be a valuable reference for enterprise applications that, for example, use Generative AI.

The AI-DSF structure is composed of: Basic Controls, Foundational Controls, and Organizational Controls. All of these controls are internal (the organization) and external (supply chain) facing. Supply chain members are subject to a meet-or-exceed standard that corresponds to the organization’s policies.

Basic Controls

  • Data Governance – Policies, processes, procedures, and practices comply with relevant legal and regulatory requirements as well as relevant standards, best practices, and industry guidelines; data licensing requirements are routinely reviewed, updated, and enforced throughout relevant functions of the organization; all Basic and Foundational Control functions are coordinated and aligned with legal, regulatory, and contractual requirements; application design, development and life cycle is aligned with data dimension principles (“data dimension” is discussed in more detail below); relevant principles of the AI Life Cycle Core Principles are integrated into organizational policies, processes, procedures, and practices.
  • Data Inventory – Incudes control for data type – public, expert, or synthetic; Sets purpose, scope, roles, and responsibilities for data collection, creation, use, and retention; Identifies: (i) categories of external data sources (e.g., service providers, partners, customers); (ii) categories of internal data sources (e.g., employees, prospective employees, contractors); (iii) how data set diversity and sufficiency is established and monitored; and (iv) the data storage environment – geographic location, internal, cloud, third party; Data supply chain management practices are periodically reviewed and updated to comply with relevant laws, regulations, best practices, and organizational risk tolerance; Addresses data life cycle, including compliance with legal and other variables; Data licensing requirements are reviewed and complied with prior to dataset ingestion.
  • Continuous Data Vulnerability Management – Data quality baseline is set and regularly monitored; Proven anomaly detection tools are continuously used; Detected data anomalies are analyzed to determine cause, vector, and assess operational and legal impact (necessary for the Data Incident Response Plan); Corrective action is implemented and documented within established reasonable timeframes; Data deletion and data unlearning methodologies are readily available and implementable; Implements policies and procedures to regularly test for alignment with enabling explainability (XAI); Policies and procedures are designed with reference to well-accepted standards.
  • Data Incident Response Plan – A formal structure for planning and implementing an effective response to a detected data vulnerability.
  • Secure Configuration for Data – Aligns with internal (e.g., IT policy) and external (e.g., supply chain management) policies and procedures; Cybersecurity environment is maintained in accordance with well-defined and accepted standards.
  • Maintenance, Monitoring, and Analysis of Data – Roles and responsibilities are identified and formalized for human oversight; Acceptable Use policies are in place and enforced; A lessons-learned approach is implemented for analyzing vulnerabilities, threats, and incidents of compromise; Periodic use of Data Protection Impact Assessment (DPIA).

Foundational Controls

  • Data Storage Protections – Physical environment monitoring is consistent with well established best practices and/or standards, such as NIST SP 800-53.
  • Data Threat Defenses – Implements and maintains processes and procedures to identify and mitigate threats from internal (e.g., employees) and external threats (e.g., hackers); Incorporates threat and vulnerability information from information sharing sources; Reference and periodic updates made to ISO 27001.
  • Data Provenance Protections – Employs blockchain-based security mechanisms; Where alternative or compensating security methods are used, sufficient documentation is available to demonstrate reasonableness of selection; Implements guardrails to protect against use of unlicensed, unverified, and unintended data sets; Implements protection against use of unlicensed output data; Implements protection against using outdated data; Implements safeguards against data poisoning; Data licensing practices are regularly reviewed to ensure alignment with intended use.
  • Secure Configuration for all Data Sources – All computing and data storage assets are aligned with security configurations in accordance with the organization’s IT policy.
  • Data Sources Boundary Defense – Data flow is continuously monitored; Only pre-approved external data sources are permitted access and use.
  • Controlled Access to Data Sources – Access is restricted to pre-approved data sources.
  • Audit and Control – Internal and external (supply chain) periodic audit of all Foundational Controls; findings are regularly provided to senior management and board of directors; Identified vulnerabilities are promptly dealt with and documented.

Organizational Controls

  • Implement Data Stewardship Program – The program has formal, documented support from senior management and is subject to periodic review and audit by the board of directors and outside auditors.
  • Data Incident Response Management – Data incidents are managed through a formal, documented process that is reviewed by relevant members of senior management.
  • Fuzzing Tests and other Red Team Exercise – Vulnerabilities in the Basic and Foundational Controls are periodically tested, documented and reported to senior management.

Purpose

The AI-DSF aims to formalize the generation, supply, protection and use of high quality data for training algorithms. Doing something once or a handful of times will not yield a meaningful, positive result, especially when it comes to the complex undertaking detailed in this framework. This is underscored by the term “formalize.” It shows the importance of rendering all the stewardship tasks as something that is repeatable and measurable.  Underlying all of this effort is effective governance. Senior leadership buy-in, including the board of directors is essential and a central characteristic of the framework’s formal nature.

“High quality” is a complex attribute in that it contains numerous variables, or “dimensions.” MIT researchers Richard Wang and Lisa Guarascio list 20 data dimensions: Believability, value added, relevancy, accuracy, interpretability, ease of understanding, accessibility, objectivity, timeliness, completeness, traceability, reputation, representational consistency, cost effectiveness, ease of operation, variety of data & data sources, concise, access security, appropriate amount of data, and flexibility. (See, Dimensions of Data Quality: Toward Quality Data by Design 1991.) I don’t consider this an exhaustive list and it may even benefit from other changes, such as adding “integrity,” “resilient,” and “hygienic.” But it is an important reference tool for determining what exactly we need to pay attention to in this framework.

Use Cases for the AI-DSF

  1. The AI-DSF can be used by developers in the development gating process. Among the inherent benefits here is that it can help simplify compliance with some of the developer’s contractual obligations. For instance, a typical provision requires that the developer represent and warrant that the application is in compliance with all applicable law and regulations. While this is a very broad requirement, the AI-DSF can help meet part of that obligation insofar as it relates to the qualities of the data that impact privacy, copyright, etc.
  2. Service level agreements (SLA): An application developer could employ it for determining how to better draft the data supply agreement and the SLA with the data broker. This helps ensure the data delivery is relevant and responsive to the needs of the algorithm. Why is this important? Consider, for example, a scenario where the data broker delivered data that is accurate, concise, and free of data poisoning, but it was not timely delivered. That’s a huge problem. Training an algorithm on data that is outdated can be devastating and so appropriate contractual provisions are vital to mitigate that risk.
  3. Legal rights: Licensees typically rely on a representation and warranty by the licensor as to the quality of the data: the rights in the data (e.g., having an appropriate license), provenance, etc. The AI-DSF provides a useful framework for ensuring the data provided is high quality, not tainted with legal problems such as infringement or other undesirable variables such as offensive, inaccurate, or controversial data.
  4. Guide service agreement negotiations: Subscribers in a service agreement can use the AI-DSF to contractually obligate the provider to maintain select policies and procedures that ensure data integrity. For example, add a requirement to implement and maintain protection against using outdated data.
  5. Compliance with privacy law: The privacy landscape is increasingly complicated. The absence of a federal law in the U.S., has driven more and more states to put privacy laws on the books. This is also a hot topic for regulatory bodies (e.g., the FTC) and for their part they remain vigilant and periodically sharpen their and guidance and enforcement efforts. On the international front, regulations such as the General Data Protection Regulation (GDPR) create additional layers of complexity that must be complied with. In this setting, adhering to the AI-DSF becomes a useful guide for helping mitigate the risk inherent in using data that might contain private information.

Notes and Discussion

  1. The AI-DSF design references: NIST Cybersecurity Framework; NIST Privacy Framework; NIST AI Risk Management Framework; CIS-20 Cybersecurity Controls; Information Commissioner’s Office AI Auditing Framework (draft guidance for consultation).
  2. Data Governance – There are two driving principles here: (1) enable and (2) demonstrate compliance with applicable legal, regulatory, and best practices. The latter can be of particular importance in the organization’s routine operations, such as responding to an RFP, complying with investor and other stakeholder demands.
  3. Blockchain-based security mechanisms are one way to secure data provenance. Where an organization chooses a different method to secure data provenance, it will likely be important to ensure that there is sufficient documentation that explains the rationale for its use. Ultimately, the question will be whether the selected method is reasonable considering all the surrounding circumstances that led to its selection and implementation.
  4. A DPIA is an essential tool for demonstrating that the organization is using legally reasonable means for risk assessment of data used by AI.
  5. The OECD Framework for the Classification of AI Systems addresses data and input. It has “Collection” criteria and “Rights & Identifiability” criteria. Missing from this framework is reference to the Basic Controls of the AI-DSF.
  6. The Data Incident Response Plan guides the response so as to minimize the potential for degraded data materially impacting the AI application(s). A data incident can be defined in a number of ways. For example: Any unauthorized instance in which there is an attempt, whether successful or not, to access the data. Limiting the inquiry to whether or not there was access, as opposed to whether data was tampered with, is but one response option. Selecting it may be reasonable depending on the incident’s surrounding circumstances and the severity level that is assigned. Severity levels are typically assigned from level one to three. Level one is the most minor and used where the incident can be dealt with internally and has no impact on the organization’s operation. A level three is the most severe. Regulatory agencies and/or law enforcement need to be involved, multiple stakeholders are affected, and there is a material risk to the ongoing normal operation of the business. A Data Incident Response Plan can be a subset of the organization’s overall incident response plan or be a separate document. However it is implemented, it is important to confirm there are no gaps with other relevant policies, such as the IT policy.
  7. Supply chain members are subject to a meet-or-exceed standard that corresponds to the organization’s policies. This requires identifying specific requirements in the contract with the supplier. For example, if the organization uses SHA-256 for encryption, the supplier must either use the same or a better protocol.
  8. Pre-approved data sources are those that are subject to contractual requirements with the data provider.
  9. Timely detecting data vulnerability plays an essential part in properly managing the organization’s risk. For example, demonstrating that your organization regularly looks for and effectively responds to indications of compromise can be valuable for minimizing/eliminating liability in the event of a regulatory inquiry.
  10. Data Provenance Protections –  A key function is protection against use of unlicensed output data which is where an LLM is trained on the output of another LLM. While this practice may have some engineering upside in the form of cutting the amount of training time, the potential legal downside may be significant. From a copyright perspective, as long as such training falls under fair use, the question of liability for infringement can be set aside. But things get murkier from a contractual perspective, where the output data is protected by terms of use. In this setting, training on output data without permission can trigger a breach of contract claim.
  11. OpenAI has until April 30 to comply with Italian regulator data privacy requirements. It is uncertain (to say the least) that OpenAI will be able to comply. France, Germany, Ireland, and Canada are also looking into OpenAI’s data collection and use practices. And there’s more trouble brewing with the European Data Protection Board (EDPB) also setting its sights on OpenAI. All developers of generative AI apps need to pay close attention to these developments. An important takeaway here is for developers to make sure they comply with the AI Data Stewardship Framework or something similar. It may be the only way to satisfy data privacy legal requirements. Update 5-2-2023: TechCrunch reports that ChatGPT is back in Italy. OpenAI made changes to the way it presents its service to Italian users, including an age verification pop-up and alerting the user to a privacy policy with links to a help center. Will the Italy experience guide the way out of other EU/EDPB challenges to generative AI? Maybe. What is also of interest is how this all ties in with another topic I first wrote about in 2011 under the title Maximizing Representative Efficacy – Part I and with its Part II which came out shortly thereafter. Now, Part II is relatively more relevant to this discussion, but Part I sets the stage, building on an excellent law review by Richard Craswell, Taking Information Seriously: Misrepresentation and Nondisclosure in Contract Law and Elsewhere. Part II is particularly relevant here because it describes the role of AI apps in helping users understand terms of service. Considering the latest seismic changes in public awareness and apprehension of AI, the role of these apps (the good ones anyway) will be vital for making these services more accessible and compliant with what will likely be rigorous regulatory and enforcement efforts.
  12. Data Incident Response Plan – How does the organization respond and contain an external-based attack vector that pollutes its LLM data set? The manner of response is dependent and guided by the degree of model adherence to AI Life Cycle Core Principle variables such as Resilience. For example, in a model with a high degree of Resilience, the organization may have more time to alert its end user base of the incident and minimize the degree of harm. This can be an important way to maximize the ability to comply with service level agreements and other contractual obligations.
  13. Techniques such as Word Prediction Accuracy (WPA) provide an example of operationalizing part of the Maintenance, Monitoring, and Analysis of Data function. See https://lilt.com/research.
  14. The Continuous Data Vulnerability Management function calls for the availability and implementation of reliable data deletion methodology such as is provided by Machine Unlearning (MU). For more on MU, see Machine Unlearning: its nature, scope, and importance for a “delete culture.” Within this function there are also processes that deal with ongoing validation at various checkpoints during the application’s development and, if relevant, post-deployment.
  15. Data Provenance Control tackles challenges such as copyright infringement. When it comes to training foundation models, there are two primary focal points here: (i) Whether it is permissible to use the data for training and (ii) does the output infringe? Permissible training occurs when it is on data that is not subject to copyright, the training is carried out pursuant to a license (or an implied license such as the absence of robots.txt or other retstriction), or when the use of the data set qualifies as fair use. As for the output, it does not infringe if it qualifies as fair use. In this note we consider what infringement guardrails look like in the context of fair use and for this we focus on the transformative factor of fair use. We can see that the transformative variable may be much too dependent on the quality of the end user’s prompt. Now, suppose that it’s true that a sophisticated prompt (e.g., one with significant semantic texture) is more likely to generate output that qualifies as transformative than one done coming from a simple prompt (lacking significant semantic texture). If that is the case, we can use the prompt’s quality as a factor in scoring the probability of infringing output. This is accomplished by training an algorithm to measure and score the quality of a given prompt. The prompt’s score is the product of comparing the prompt to others that have similar characteristics and this is done by reference to an ontology of prompts. Again, presuming it is true that sophisticated prompts are more likely to yield non-infringing output, the prompt scoring approach can help bring us closer to spotting generative output that more likely passes the transformative test in fair use. Model providers could then put in place prompt guardrails that help the end user lower the probability of generating output with a low transformative score. This would include protection against prompts that are known or are likely to generate infringing output. Finally, model providers that opt not to use this prompt scoring approach would not benefit from a safe harbor (if and once something like that is set).
  16. Data Provenance Protections contains a requirement for implementing guardrails to protect against use of unintended data sets. Part of the focus here is to identify whether the AI model is using personal data unintentionally. If it is, appropriate actions need to be taken to prevent use of this model until there is assurance that the personal data has been removed. The use of personal data, even if it is unintentional, is irrelevant to the question of the need to comply with applicable laws and regulations. The GDPR, for example, has a number of articles that apply (e.g., Articles 4, 5, 6, 7, 22). Being unaware of their applicability creates unmitigated risk.
  17. Dataset diversity is a key feature of the Data Inventory Controls. It is important for enhancing model alignment with the Fairness life cycle core principle, a principle which aims to manage against unintended disparate treatment and reduce unexpected outcomes. As such, insufficient or lack of dataset diversity can be regarded as a data quality deficiency and risk which can have undesirable downstream consequences. This makes dataset diversity a key contracting checkpoint for end users. In this setting, contractually obligating the system provider to exercise best practices in controlling for dataset diversity helps the end user better manage its risk, internally (e.g., for its employees) and externally (e.g., its licensees).
  18. How were the humans selected for the Reinforcement Learning from Human Feedback (RLHF) task? This is an important question because RLHF plays an important role in teaching the AI agent how to improve its decision-making capabilities. Accordingly, this question should be asked during either the vendor selection process (RFP) or at the contract negotiation phase. A developer that can disclose its RLHF methodology is more effectively aligned with the Transparency core principle. Additionally, this inquiry also aligns both parties, the developer and end user, with the Ethics life cycle core principle, which calls for (in relevant part) the implementation of practices that manifest socially beneficial conduct.
  19. The AI-DSF enables most of the AI life cycle core principles. For example: Accountability, Accuracy, XAI, Fidelity, Governance, Permit, Privacy, Relevant, Security, Transparency, and Trustworthy. Adhering to the AI-DSF, or a similar data stewardship framework, therefore, can help developers and end users reduce their risk in using AI.
  20. The Federal Trade Commission (FTC) has at its disposal the remedy of algorithmic disgorgement to enforce deletion of algorithms created using data for which the developer had no rights or which is in violation of law. Such an outcome can have significantly expensive, even devastating consequences for the developer, and it can also negatively impact the developer’s end users that were using the “tainted” application, potentially even further magnifying the developer’s liability exposure. The AI-DSF helps solve this problem. Properly implemented, the AI-DSF helps the developer ensure it is not using tainted data. And apart from the benefit of not being subject to algorithmic disgorgement, the developer can adopt licensee-friendly terms and conditions with respect to infringement indemnification and defense, which can help differentiate its services from those of its competitors.
  21. Limiting training to datasets comprised of: (i) what has been licensed by the organization;  (ii) its own creations; (iii) works in the public domain, and (iv) moderated generative AI content coalesces to ensure the organization is not infringing on copyrighted works of others. The benefits of this type of practice also extend beyond the organization. In the case of AI service providers, the benefit of this risk reduction approach can also be flowed down to its end users in the form of licensee-friendly liability mitigating terms and conditions, namely the infringement indemnification.
  22. The cybersecurity requirement in the Secure Configuration for Data foundational control calls for a regime designed in accordance with standards. The term “standards” is loosely used. It includes ISO 27001 (one of the best known standards), NIST Cybersecurity Framework, CIS Critical Security Controls, PCI DSS, etc. The intent is to ensure that the organization is using a widely accepted data security methodology and not a one-off, informal approach. The more grounded the cybersecurity regime is, the better it aligns with various AI Life Cycle Core Principles, such as: Accountability, Accuracy, Big Data, Ethics, XAI, Fairness, Human Centered, Privacy, Relevant, Reliability, Resilience, Robust, Safety, and Security.
  23. Developers that follow the data licensing guideline (see Data Governance) are in a good position to offer their licensees substantive infringement indemnification and defense protections. This practice is valuable. It can help differentiate developers from competitors that do not follow this process and rely on sweeping liability disclaimers.
  24. Governance and board oversight are critical for ensuring proper implementation of the AI-DSF. For such oversight to be effective, senior management and the relevant board members should have sufficient understanding of the various controls and functions of the AI-DSF, receive periodic reports (as required by the controls), and routinely confirm that the organization has sufficient resources to maintain the AI-DSF. Additionally, a hallmark of effective Data Governance is an operational environment that only uses data that is relevant to the authorized activity. An “authorized activity” derives from a formal agreement (contract) with another party. This is not limited to a B2B agreement and includes legally-valid consent obtained from an individual.
  25. Within the Data Provenance Protections controls there is an emphasis on avoiding challenging, potentially expensive, even devastating scenarios that arise from regulatory or court ordered model disgorgement. It is important to keep in mind that the destructive effect of model disgorgement can extend to the developer’s end users (subscribers). If that occurs, additional liability can pop up due to a breach of contract, subscriber loss, and reputational damage. All said, the risk of failing to adhere to the Data Provenance Protections is so clear and significant that it should be a subject that is regularly inquired upon by the developing company’s senior leadership. (See also Notes 20 and 24.)
  26. A key part of the Continuous Data Vulnerability Management control is dealing with data anomalies. Toxic data is considered a data anomaly. When it occurs, the Data Incident Plan should be referenced in the effort to determine cause, vector, and assess operational and legal impact. In addition, corrective actions measures should be taken to prevent recurrence.
  27. The AI-DSF is designed to help maintain the developer’s alignment with multiple AI Life Cycle Core Principles. It supports, for example, compliance with the Ethics principle by protecting against misinformation and disinformation. Protecting against both of these types of harms is part of the effort to focus on developing and maintaining AI applications that are socially beneficial.
  28. The absence of effective contracting practices with data supply chain members can degrade compliance with the requirements under the Data Inventory Controls. Take, for example, the use of poorly drafted service level agreements. This could be considered as a sign of misalignment with data life cycle management best practices. If that practice is identified (ideally through the periodic review requirements under this control group), a thorough investigation should be conducted into all data supply management practices (not just those relating to contracting).
  29. Statistical exploration of data used/to be used is part of the Continuous Data Vulnerability Management control. Policies and procedures under this control should reference well-accepted standards such as ISO/IEC 42001.
  30. Requiring end users to routinely and manually check generative AI output is important. (This can also be performed via red teaming.) It is part of the Acceptable Use policy which is required by the Maintenance, Monitoring, and Analysis of Data control.
  31. The AI-DSF aligns with the Fair Information Principles (FIPS). FIPS came out in the early 70’s and remains monumentally important for maintaining data privacy. It is built around seven principles that are required of entities that collect and process personal information: (1) placing limits on information use; (2) formalizing data minimization; (3) limiting disclosure of personal information; (4) collecting and using only information that is accurate, relevant, and up-to-date; (5) enabling individuals with notice, access, and correction rights; (6) building transparent data processing systems; and (7) providing security for personal information.
  32. In the NIST Privacy Framework, reference to the Data Governance control can (also) be found in the CT.PO-P4 subcategory of the “Control” function. Instead of using “govern,” NIST refers to this function as “data life cycle manage[ment].” The “life cycle” here can be seen as cautionary, steering organizations away from adopting a one-off approach to the tasks required under Data Governance.
  33. Satisfying the Data Sources Boundary Defense and the Controlled Access to Data Sources controls requires (among other things) effective data supply chain management practices. For example, a Network Access Agreement or similar contract should be in place before any access is granted. The term of the agreement should be carefully considered, with the default choice being a shorter duration and renewal of the agreement conditioned on the data supplier satisfying all security requirements. This approach helps prevent scenarios where suppliers have outdated or otherwise poorly managed access to data assets.