Why AI Cannot Forget Genomic Data

The right to be forgotten is becoming increasingly difficult to enforce in practice for genomic data trained on Artificial Intelligence (AI) models. Once genomic data is used to train AI models, it becomes embedded in the model’s parameters. As these systems learn, they can enable the re-identification of individuals and the inference of health risks.[1] Even where valid consent exists, a model’s future inferences may exceed the original scope of consent or persist after the data subject’s lifetime.

Genomic data is inherently familial, connecting an individual to a vast network of biological relatives. The inferential capabilities of AI enable the identification not only of data subjects but also potentially of their family members, who have neither been informed nor given consent. AI thus reframes privacy concerns from unauthorized access to ongoing predictive inference about individuals and their relatives. This means that traditional data protection law, built around individual control over retrievable records that can be located and erased, is mismatched with how AI actually processes genomic data. At the same time, machine unlearning, which aims to remove particular data points from a trained model, remains technically challenging.

The Legal Disconnect

The right to be forgotten was established following the Court of Justice of the European Union (CJEU)’s 2014 Google Spain v AEPD decision.[2] This right was later codified in Article 17 of the European Union General Data Protection Regulation (GDPR), which authorizes data subjects to request the deletion of their personal data under specific circumstances.[3] By contrast, the United States lacks a federal equivalent. [4] Instead, the right to be forgotten is regulated under a patchwork of state laws, such as the California Consumer Privacy Act (CCPA).[5]

The right to be forgotten centers on the threshold of personal data. Under the GDPR, this is defined broadly as any information relating to an identified or identifiable person, including genetic and pseudonymized data.[6] Similarly, the CCPA covers these terms and extends to “abstract digital formats,” a category that captures AI systems capable of outputting personal information.[7] The CCPA further recognizes “unique identifiers” that identify a consumer or family, as well as “probabilistic identifiers” that link a consumer or device using specific data categories.[8] However, these frameworks primarily address outputs or stored records, not the internal architecture of AI models. It remains unclear how the right to be forgotten should apply to internal model parameters that encode learned patterns that enable inferences. Regarding genomic AI, a model does not require access to a data subject’s stored DNA sequence; it only needs to have learned the relevant patterns to infer traits about the data subject or their family.

A recent case further illustrates this disconnect. In the CJEU’s EDPS v SRB 2025 decision, data stripped of direct identifiers remains protected as personal data if re-identification is “reasonably likely.” The determination depends on the technical, organizational, and legal measures available to the specific recipient.[9] However, this reasoning presumes a static dataset capable of re-identification, instead of a probabilistic model that generates inferences from learned patterns. Moreover, the right to be forgotten is centered on an individualistic right. Yet, genomic data is inherently shared in nature. Consider siblings who share genomic data: if one consents to sharing their data while the other requests deletion. Whose right prevails? Does one sibling’s erasure undermine the other’s autonomy? This becomes even more complex when considering the entire network connected to that single data point. This tension suggests the need to reconsider privacy rights at a collective or familial level. As genomic data is a shared resource, an AI model serves as a population-wide asset. In that setting, individual removal might impact the broader data environment.

The Technical Reality: Machine Unlearning

A comprehensive solution for machine unlearning remains technically elusive. Existing methods experience significant scale limitations and are not yet suitable for routine deployment in large-scale production systems. The most direct approach to delete specific data and retrain the model from scratch is computationally expensive. Ideally, an effective “unlearning” standard would balance three competing objectives: speed and cost, privacy, and fairness. Speed requires unlearning to be performed quickly and efficiently. Privacy demands erasure of the targeted data’s influence on model parameters. Fairness ensures that removing data from one group does not degrade predictions or utility for other groups.[10] Fairness is particularly challenging because models typically evaluate groups and relationships rather than isolated individuals. Removing a single data point potentially distorts comparisons and the fairness constraints in which that point played a role. Deleting one record may require updating parameters at the group or population level to restore balance within the model. These adjustments could introduce new biases or reduce overall model performance.[11]

In the context of genomic data, even if direct data was surgically removed, embedded familial patterns can persist within the model’s learned parameters. This challenge is further complicated by the increasingly interdependent algorithmic supply chains. Major tech companies, such as Amazon, Microsoft, and Google, offer integrated “all-in-one” packages where foundation models and downstream applications are tightly coupled.[12] In this setting, unlearning a single record does not guarantee the erasure of those patterns across the entire supply chain’s architecture. These technical limitations directly challenge the obligations to remove data “without undue delay” under the GDPR’s requirement, or within the CCPA’s 45 to 90 day timeframe.[13]

Looking Ahead

The convergence of genomic data and AI has exposed the limits of the right to be forgotten in both the legal framework and technical reality. This highlights the need for a modernized legal framework that prioritizes collective rights and algorithmic accountability over individualistic control. Moving forward, as AI-trained genomic data moves toward breakthroughs that could save millions, it introduces a tension between the individual’s right to be forgotten and the collective right to health. The solution likely lies not in absolute erasure, but in governing how AI remembers, ensuring that persistence of data serves the public good.

References

[1] Giacomo Nebbia et al., Re-identification of Patients from Imaging Features Extracted by Foundation Models (2025), https://www.nature.com/articles/s41746-025-01801-0; Artem Shmatko et al., Learning the Natural History of Human Disease with Generative Transformers (2026), https://www.nature.com/articles/s41586-025-09529-3; and Rose Orenbuch et al., Proteome-wide Model for Human Disease Genetics (2025), https://www.nature.com/articles/s41588-025-02400-1.

[2] European Union, Right to be Forgotten on the Internet (2022), https://eur-lex.europa.eu/EN/legal-content/summary/right-to-be-forgotten-on-the-internet.html (ruling that the search engine operator has a responsibility to remove links to personal information from search results in specific circumstances). See also, Judgment of the Court (Grand Chamber), 13 May 2014. Google Spain SL and Google Inc. v Agencia Española de Protección de Datos (AEPD) and Mario Costeja González.

[3] EU GDPR, Article 17.1 (a)-(f) and Article 17.3 (stating conditions for erasure under Article 17.1(a)-(f), such as the data subject withdraws consent, or the personal data have been unlawfully processed. And, providing circumstances when the right to erasure shall not apply to the extent that processing is necessary, such as for exercising the right of freedom of expression and information).

[4] A comprehensive federal privacy framework has not been adopted in the United States. Data protection is governed by sector-specific laws, such as the Health Insurance Portability and Accountability Act (HIPAA) and the Genetic Information Nondiscrimination Act (GINA).

[5] CCPA, Section 1798.105.

[6] EU GDPR, Articles 4(1), 4(5), and 4(13).

[7] CCPA, Section 1798.140 (v)(4)(c) (defining that personal information can exist in abstract digital formats, including compressed encrypted files, metadata, or artificial intelligence systems that are capable of outputting personal information.); see also California Assembly Bill No. 1008 (2024).

[8] CCPA, Sections 1798.140 (x) and 1798.140 (aj).

[9] European Data Protection Supervisor v Single Resolution Board (EDPS v SRB), Appeal, Case C-413/23 P.  (Sept. 4, 2025), https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex:62023CJ0413.

[10] Alex Oesterling, et al., Fair Machine Unlearning: Data Removal While Mitigating Disparities (2023), https://proceedings.mlr.press/v238/oesterling24a/oesterling24a.pdf.

[11] Id.; see also George-Octavian Barbulescu and Francois Buet-Golfouse, Unfair Unlearning? Accounting for Fairness in Machine Unlearning (2018), https://kdd2025.kdd.org/wp-content/uploads/2025/07/paper_23.pdf;

[12] Jennifer Cobbe, Understanding Accountability in Algorithmic Supply Chains (2023), https://arxiv.org/abs/2304.14749.

[13] Ben Wolford, Everything You Need to Know About the “Right to be Forgotten”, https://gdpr.eu/right-to-be-forgotten/(stating that “undue delay” is considered to be about a month); CCPA, Section 1798.130(a)(2)(A).