A Data Stewardship Framework for Generative AI
Data is the lifeblood of generative AI applications and these apps are ultimately only as good as the data they train on. It is obvious, therefore, that having and maintaining policies and procedures that are specifically designed to ensure high quality data is continuously provided is critical. I refer to this overall effort as the AI Data Stewardship Framework (AI-DSF).
Below is a rough draft of what this framework looks like along with examples that describe how or why certain actions are relevant or important. This is currently a work in progress. I will periodically update this post, gradually building more detail into it.
Who is this for? The AI-DSF can be used by data brokers, data consumers, companies that build generative AI applications, and by AI auditors.
The AI-DSF structure is composed of: Basic Controls, Foundational Controls, and Organizational Controls.
Basic Controls
- Data Governance – Captures relevant legal and regulatory requirements; all control functions are coordinated and aligned; promotes consistent alignment with data dimension principles. (The term “data dimension” is discussed in more detail below.)
- Data Inventory Controls – Incudes control for data type, such as public, expert, or synthetic. Sets purpose, scope, roles, and responsibilities. Identifies: (i) Categories of external data sources (e.g., service providers, partners, customers); (ii) Categories of internal data sources (e.g., employees, prospective employees, contractors); (iii) The data storage environment (geographic location, internal, cloud, third party); Addresses data lifecycle, including legal and other determinants.
- Continuous Data Vulnerability Management (ties in with data observability practices) – Data quality baseline is set; proven anomaly detection tools are continuously used; detected data anomalies are analyzed to determine cause, vector, and assess operational impact (necessary for the Data Incident Response Plan); corrective action is implemented within reasonable timeframes.
- Data Incident Response Plan – A formal structure for planning and implementing an effective response to a detected data vulnerability.
- Secure Configuration for Data – Aligns with internal (e.g., IT policy) and external (e.g., supply chain management) policies and procedures.
- Maintenance, Monitoring, and Analysis of Data –
Foundational Controls
- Data Storage Protections – Physical environment monitoring consistent with established best practices and/or standards, such as NIST SP 800-53.
- Data Threat Defenses – Implements and maintains processes and procedures to identify and mitigate threats from internal (e.g., employees) and external threats (e.g., hackers); Incorporates threat and vulnerability information from information sharing sources; Reference and periodic updates made in reference to ISO 27001.
- Data Provenance Protections
- Secure Configuration for all Data Sources
- Data Sources Boundary Defense
- Controlled Access to Data Sources
- Audit and Control (for the above)
Organizational Controls
- Implement Data Stewardship Program
- Data Incident Response Management
- Fuzzing Tests and other Red Team Exercise
The AI-DSF is designed to formalize the generation and protection of high quality data that is used for training algorithms. As a practical matter, the term “high quality” varies depending on the intended end use; it means one thing in the context of a medical application used on humans and another in a marketing application designed to increase sales.
From a governance framework perspective the term means that the data exhibits all, or at least a relevant set, of data qualities, or as MIT researchers Richard Wang and Lisa Guarascio, put it: “data dimensions.” Wang and Guarascio identified 20: Believability, value added, relevancy, accuracy, interpretability, ease of understanding, accessibility, objectivity, timeliness, completeness, traceability, reputation, representational consistency, cost effectiveness, ease of operation, variety of data & data sources, concise, access security, appropriate amount of data, and flexibility. (See, Dimensions of Data Quality: Toward Quality Data by Design 1991.) I don’t consider this an exhaustive list and it may even benefit from other changes, such as adding “integrity,” “resilient,” and “hygienic.” But it is an important reference tool for determining what exactly we need to pay attention to in this framework.
The uses for the AI-DSF are varied:
- In a supply chain management setting: Application developer could employ it for determining how to better draft the data supply agreement and the service level agreement (SLA) with the data broker. This helps ensure the data delivery is relevant and responsive to the needs of the algorithm. Why is this important? Consider, for example, a scenario where the data broker delivered data that is accurate, concise, and free of data poisoning, but it was not timely delivered. That’s a huge problem. Training an algorithm on data that is outdated can be devastating and so appropriate contractual provisions are vital to mitigate that risk.
***Notes and Discussion***
- The AI-DSF design references the NIST Cybersecurity Framework and the CIS-20 Cybersecurity Controls.
- Data Governance – The legal and regulatory requirements vary depending on, for example, the company, the application, and the end-use.
- The OECD Framework for the Classification of AI Systems addresses data and input. It has “Collection” criteria and “Rights & Identifiability” criteria. Missing from this framework is reference to the Basic Controls of the AI-DSF.
- The Data Incident Response Plan guides the response to minimize the potential for degraded data materially impacting the AI application(s). A data incident can be defined in a number of ways. For example: Any unauthorized instance in which there is an attempt, whether successful or not, to access the data.