AI Data Stewardship Framework

Data is the lifeblood of generative AI applications and these apps are ultimately only as good as the data they train on. It is obvious, therefore, that having and maintaining policies and procedures that are specifically designed to ensure high quality data is continuously provided is critical. I refer to this overall effort as the AI Data Stewardship Framework (AI-DSF).

To remain relevant, the AI-DSF is periodically updated. As its name implies, the Use Cases for the AI-DSF section offers examples of when reference to the framework makes sense. Finally, the Notes and Discussion section (below) is reserved for taking a closer look at the various controls and at practical ways in which they can come into play.

Who is this for? The AI-DSF can be used by data brokers, data consumers, companies that build generative AI applications, AI auditors, regulators (e.g., FTC), courts, and lawmakers. It is not limited to the use of data sourced from public domains and can be a valuable reference for enterprise applications that, for example, use Generative AI.

The AI-DSF structure is composed of: Basic Controls, Foundational Controls, and Organizational Controls. All of these controls are internal (the organization) and external (supply chain) facing. Supply chain members are subject to a meet-or-exceed standard that corresponds to the organization’s policies.

Basic Controls

  • Data Governance – Captures relevant legal and regulatory requirements; All control functions are coordinated and aligned with legal and regulatory requirements; Promotes consistent alignment with data dimension principles (“data dimension” is discussed in more detail below); Relevant principles of the AI Life Cycle Core Principles are integrated into organizational policies, processes, procedures, and practices.
  • Data Inventory Controls – Incudes control for data type – public, expert, or synthetic; Sets purpose, scope, roles, and responsibilities; Identifies: (i) Categories of external data sources (e.g., service providers, partners, customers); (ii) Categories of internal data sources (e.g., employees, prospective employees, contractors); (iii) The data storage environment – geographic location, internal, cloud, third party; Data supply chain management practices are periodically reviewed and updated to comply with relevant laws, regulations, best practices, and organizational risk tolerance; Addresses data lifecycle, including legal and other determinants.
  • Continuous Data Vulnerability Management (ties in with data observability practices) – Data quality baseline is set; Proven anomaly detection tools are continuously used; Detected data anomalies are analyzed to determine cause, vector, and assess operational and legal impact (necessary for the Data Incident Response Plan); Corrective action is implemented within reasonable timeframes; Data deletion methodologies are readily available and implementable.
  • Data Incident Response Plan – A formal structure for planning and implementing an effective response to a detected data vulnerability.
  • Secure Configuration for Data – Aligns with internal (e.g., IT policy) and external (e.g., supply chain management) policies and procedures.
  • Maintenance, Monitoring, and Analysis of Data – Roles and responsibilities are identified and formalized for human oversight; A lessons-learned approach is implemented for analyzing vulnerabilities, threats, and incidents of compromise; Periodic use of Data Protection Impact Assessment (DPIA).

Foundational Controls

  • Data Storage Protections – Physical environment monitoring is consistent with established best practices and/or standards, such as NIST SP 800-53.
  • Data Threat Defenses – Implements and maintains processes and procedures to identify and mitigate threats from internal (e.g., employees) and external threats (e.g., hackers); Incorporates threat and vulnerability information from information sharing sources; Reference and periodic updates made in reference to ISO 27001.
  • Data Provenance Protections – Employs blockchain-based security mechanisms; Where alternative security methods are used, sufficient documentation is available to demonstrate reasonableness of selection; Implements protection against use of unlicensed output data.
  • Secure Configuration for all Data Sources – All computing and data storage assets are aligned with security configurations in accordance with the organization’s IT policy.
  • Data Sources Boundary Defense – Data flow is continuously monitored; Only pre-approved external data sources are permitted access and use.
  • Controlled Access to Data Sources – Access is restricted to pre-approved data sources.
  • Audit and Control – Internal and external (supply chain) periodic audit of all Foundational Controls; Findings are regularly provided to senior management; Identified vulnerabilities are promptly dealt with and documented.

Organizational Controls

  • Implement Data Stewardship Program – The program has formal, documented support from senior management and is subject to periodic review and audit.
  • Data Incident Response Management – Data incidents are managed through a formal, documented process that is reviewed by relevant members of senior management.
  • Fuzzing Tests and other Red Team Exercise – Vulnerabilities in the Basic and Foundational Controls are periodically tested, documented and reported to senior management.

The AI-DSF is designed to formalize the generation and protection of high quality data that is used for training algorithms. As a practical matter, the term “high quality” varies depending on the intended end use; it means one thing in the context of a medical application used on humans and another in a marketing application designed to increase sales.

From a governance framework perspective the term means that the data exhibits all, or at least a relevant set, of data qualities, or as MIT researchers Richard Wang and Lisa Guarascio, put it: “data dimensions.” Wang and Guarascio identified 20 data dimensions: Believability, value added, relevancy, accuracy, interpretability, ease of understanding, accessibility, objectivity, timeliness, completeness, traceability, reputation, representational consistency, cost effectiveness, ease of operation, variety of data & data sources, concise, access security, appropriate amount of data, and flexibility. (See, Dimensions of Data Quality: Toward Quality Data by Design 1991.) I don’t consider this an exhaustive list and it may even benefit from other changes, such as adding “integrity,” “resilient,” and “hygienic.” But it is an important reference tool for determining what exactly we need to pay attention to in this framework.

Use Cases for the AI-DSF:

  • Service level agreements (SLA): Application developer could employ it for determining how to better draft the data supply agreement and the SLA with the data broker. This helps ensure the data delivery is relevant and responsive to the needs of the algorithm. Why is this important? Consider, for example, a scenario where the data broker delivered data that is accurate, concise, and free of data poisoning, but it was not timely delivered. That’s a huge problem. Training an algorithm on data that is outdated can be devastating and so appropriate contractual provisions are vital to mitigate that risk.
  • Legal rights: Licensees typically rely on a representation and warranty by the licensor as to the latter’s rights in the data. The AI-DSF provides a better framework for ensuring the data provided is not tainted with legal problems such as infringement.
  • Compliance with privacy law: The privacy landscape is increasingly complicated. The absence of a federal law in the U.S., has driven more and more states to put privacy laws on the books. This is also a hot topic for regulatory bodies (e.g., the FTC) and for their part they remain vigilant and periodically sharpen their and guidance and enforcement efforts. On the international front, regulations such as the General Data Protection Regulation (GDPR) create additional layers of complexity that must be complied with. In this setting, adhering to the AI-DSF becomes a useful method for helping mitigate the risk inherent in using data that might contain private information.

***Notes and Discussion***

  1. The AI-DSF design references: NIST Cybersecurity Framework; NIST Privacy Framework; NIST AI Risk Management Framework; CIS-20 Cybersecurity Controls; Information Commissioner’s Office AI Auditing Framework (draft guidance for consultation).
  2. Data Governance – There are two driving principles here: enable and demonstrate compliance with applicable legal, regulatory, and best practices. The latter can be of particular importance in the organization’s routine operations, such as responding to an RFP, complying with investor and other stakeholder demands. The legal and regulatory requirements vary depending on, for example, the company, the application, and the end-use.
  3. Blockchain-based security mechanisms are one way to secure data provenance. Where an organization chooses a different method to secure data provenance, it will likely be important to ensure that there is sufficient documentation that explains the rationale for its use. Ultimately, the question will be whether the selected method is reasonable considering all the surrounding circumstances that led to its selection and implementation.
  4. A DPIA is an essential tool for demonstrating that the organization is using legally reasonable means for risk assessment of data used by AI.
  5. The OECD Framework for the Classification of AI Systems addresses data and input. It has “Collection” criteria and “Rights & Identifiability” criteria. Missing from this framework is reference to the Basic Controls of the AI-DSF.
  6. The Data Incident Response Plan guides the response so as to minimize the potential for degraded data materially impacting the AI application(s). A data incident can be defined in a number of ways. For example: Any unauthorized instance in which there is an attempt, whether successful or not, to access the data. Limiting the inquiry to whether or not there was access, as opposed to whether data was tampered with, is but one response option. Selecting it may be reasonable depending on the incident’s surrounding circumstances and the severity level that is assigned. Severity levels are typically assigned from level one to three. Level one is the most minor and used where the incident can be dealt with internally and has no impact on the organization’s operation. A level three is the most severe. Regulatory agencies and/or law enforcement need to be involved, multiple stakeholders are affected, and there is a material risk to the ongoing normal operation of the business.
  7. Supply chain members are subject to a meet-or-exceed standard that corresponds to the organization’s policies. This requires identifying specific requirements in the contract with the supplier. For example, if the organization uses SHA-256 for encryption, the supplier must either use the same or a better protocol.
  8. Pre-approved data sources are those that are subject to contractual requirements with the data provider.
  9. Timely detecting data vulnerability plays an essential part in properly managing the organization’s risk. For example, demonstrating that your organization regularly looks for and effectively responds to indications of compromise can be valuable for minimizing/eliminating liability in the event of a regulatory inquiry.
  10. Data Provenance Protections –  A key function is protection against use of unlicensed output data which is where an LLM is trained on the output of another LLM. While this practice may have some engineering upside in the form of cutting the amount of training time, the potential legal downside may be significant. From a copyright perspective, as long as such training falls under fair use, the question of liability for infringement can be set aside. But things get murkier from a contractual perspective, where the output data is protected by terms of use. In this setting, training on output data without permission can trigger a breach of contract claim.
  11. OpenAI has until April 30 to comply with Italian regulator data privacy requirements. It is uncertain (to say the least) that OpenAI will be able to comply. France, Germany, Ireland, and Canada are also looking into OpenAI’s data collection and use practices. And there’s more trouble brewing with the European Data Protection Board (EDPB) also setting its sights on OpenAI. All developers of generative AI apps need to pay close attention to these developments. An important takeaway here is for developers to make sure they comply with the AI Data Stewardship Framework or something similar. It may be the only way to satisfy data privacy legal requirements. Update 5-2-2023: TechCrunch reports that ChatGPT is back in Italy. OpenAI made changes to the way it presents its service to Italian users, including an age verification pop-up and alerting the user to a privacy policy with links to a help center. Will the Italy experience guide the way out of other EU/EDPB challenges to generative AI? Maybe. What is also of interest is how this all ties in with another topic I first wrote about in 2011 under the title Maximizing Representative Efficacy – Part I and with its Part II which came out shortly thereafter. Now, Part II is relatively more relevant to this discussion, but Part I sets the stage, building on an excellent law review by Richard Craswell, Taking Information Seriously: Misrepresentation and Nondisclosure in Contract Law and Elsewhere. Part II is particularly relevant here because it describes the role of AI apps in helping users understand terms of service. Considering the latest seismic changes in public awareness and apprehension of AI, the role of these apps (the good ones anyway) will be vital for making these services more accessible and compliant with what will likely be rigorous regulatory and enforcement efforts.
  12. Data Incident Response Plan – How does the organization respond and contain an external-based attack vector that pollutes its LLM data set? The manner of response is dependent and guided by the degree of model adherence to AI Life Cycle Core Principle variables such as Resilience. For example, in a model with a high degree of Resilience, the organization may have more time to alert its end user base of the incident.
  13. Techniques such as Word Prediction Accuracy (WPA) provide an example of operationalizing part of the Maintenance, Monitoring, and Analysis of Data function. See
  14. The Continuous Data Vulnerability Management function calls for the availability and implementation of reliable data deletion methodology such as is provided by Machine Unlearning (MU). For more on MU, see Machine Unlearning: its nature, scope, and importance for a “delete culture.”