Privacy and Synthetic Datasets


Sharing is a virtue, instilled in us from childhood. Unfortunately, when it comes to big data—i.e., databases possessing the potential to usher in a whole new world of scientific progress—the legal landscape is either too greedy or too Laissez-Faire. Either all identifiers must be stripped from the data, rendering it useless, or one-step removed personally identifiable information may be shared freely, freely sharing secrets. In part, this is a result of the historic solution to database privacy, anonymization, a subtractive technique incurring not only poor privacy results, but also lackluster utility. In anonymization’s stead, differential privacy arose; it provides better, near-perfect privacy, but is nonetheless subtractive in terms of utility.

Today, another solution is leaning into the fore, synthetic data. Using the magic of machine learning, synthetic data offers a generative, additive approach—the creation of almost-but-not-quite replica data. In fact, as we recommend, synthetic data may be combined with differential privacy to achieve a best-of-both-worlds scenario. After unpacking the technical nuances of synthetic data, we analyze its legal implications, finding the familiar ambiguity—privacy statutes either overweigh (i.e., inappropriately exclude data sharing) or downplay (i.e., inappropriately permit data sharing) the potential for synthetic data to leak secrets. We conclude by finding that synthetic data is a valid, privacy-conscious alternative to raw data, but not a cure-all. In the end, computer science progress must be met with sound policy in order to move the area of useful data dissemination forward.


Stanford University Stanford, California
  • 22 Stan. Tech. L. Rev. 1 (2019)
Related Organization(s):