Machine Learning & EU Data Sharing Practices


Publish Date:
March 24, 2020
Publication Title:
Transatlantic Antitrust and IPR Developments
Stanford Law School
Newspaper/Magazine Article Issue 1/2020
  • Mauritz Kop, Machine Learning & EU Data Sharing Practices, Transatlantic Antitrust and IPR Developments, March 24, 2020.
Related Organization(s):


This article connects the dots between intellectual property (IP) on data, data ownership and data protection (GDPR and FFD), in an easy to understand manner. It also provides AI & Data policy and regulatory recommendations to the EU legislature.

Data sharing is a prerequisite for a successful Transatlantic AI ecosystem. Hand-labelled, annotated training datasets (corpora) are a sine qua non for supervised machine learning. But what about intellectual property (IP) and data protection?

Data that represent IP subject matter are protected by IP rights. Augmented machine learning training datasets are awarded with either a database right or a sui generis database right in Europe. Unlicensed (or uncleared) use of machine learning input data potentially results in an avalanche of copyright (reproduction right) and database right (extraction right) infringements.

The article offers three solutions that address the input (training) data copyright clearance problem and create breathing room for AI developers: the implementation of a broadly scoped, mandatory TDM exception covering all types of data (including news media) in Europe, the Fair Learning principle in the United States and the establishment of an online clearinghouse for machine learning training datasets. A right to machine legibility that drastically improves access to data, will greatly benefit the growth of an AI ecosystem.

Introducing an absolute data property right or a (neighboring) data producer right for annotated machine learning training datasets or other classes of data is not opportune. Legislative gaps concerning ownership of data can be remedied by contracts. Implementing a sui generis system of protection for AI-generated Creations & Inventions is -in most industrial sectors- not necessary since machines do not need incentives to create or invent. Where incentives are needed, IP alternatives exist.

Autonomously generated non-personal data should fall into the public domain. It should be open data, excluded from protection by the Database Directive (DD), the Copyright Directive (CDSM) and the Trade Secrets Directive (TSD).

As legal uncertainty about the patentability of AI systems is causing a shift towards trade secrets, legal uncertainty about the protection and exclusive use of machine generated databases is causing a similar shift towards trade secrets. This general shift towards trade secrets to keep competitive advantages results in a disincentive to disclose information and impedes on data sharing. In an era of exponential innovation, it is urgent and opportune that both the TSD, the CDSM and the DD shall be reformed by the EU Commission with the data-driven economy in mind.

Informed IP policy seeks to compose a regime that balances underprotection and overprotection of IP rights per economic sector. Freedom of expression and information are core democratic values that should be internalized in our IP framework. The article argues that strengthening and articulation of competition law is more opportune than extending IP rights.

More and more datasets consist of both personal and non-personal machine generated data. Both the General Data Protection Regulation (GDPR) and the Regulation on the free flow of non-personal data (FFD) apply to these ‘mixed datasets’. Based on these two Regulations, data can move freely within the European Union. The article contends that in some cases, GDPR legislation causes market barriers for early-stage AI-startups (SME’s). The GDPR also has some important advantages for European SME’s since it is now the international data protection standard.

Besides the legal dimensions, the article describes the technical dimensions of data in machine learning. Most AI models need centralized data. Federated learning, in contrast, trains algorithms by bringing the code to the data, instead of bringing the data to the code. Data sharing is not required.

Both data sharing practices and AI-Regulation are high on the EU Commission’s agenda. The article discusses -inter alia- the EC’s ‘White Paper On Artificial Intelligence – A European approach to excellence and trust’ and the ‘EU Data Strategy’.

Important European initiatives in the field of open data and data sharing are: the Support Centre for Data Sharing (focused on data sharing practices), the European Data Portal (EDP, data pooling per industry i.e. sharing open datasets from the public sector), the Open Data Europe Portal (ODP, sharing data from European institutions) and the EU Blockchain Observatory and Forum.

Transformative technology is not a zero sum game, but a win-win strategy that creates new value. When developing inclusive transformative tech related policies, the goal should be a Pareto optimum and if possible a Pareto improvement by increasing overall prosperity.

Society should actively shape technology for good. The alternative is that other societies, with perhaps different social norms and democratic standards, impose their values on us through the design of their technology. With built-in public values, including Privacy by Design that safeguards data protection, data security and data access rights, the federated learning model is consistent with Human-Centered AI and the European Trustworthy AI paradigm.