1/18/2024 4:34:00 PM | 7 minute read

Synthetic data – is AI’s golden child failing to live up to its promise?

Get in touch

Marc Dautlich

Partner - Data protection, privacy & cyber

Will Hewitt

Associate - Commercial & IP transactions

Get in touch

Marc Dautlich

Partner - Data protection, privacy & cyber

Will Hewitt

Associate - Commercial & IP transactions

This article is part of our Biotech Review of the Year - Issue 11 publication

The creation and use of synthetic data holds great potential for organisations in the life sciences sector.

It promises availability of datasets where there is a natural shortage of real world data, easier access to datasets which can catalyse research, and enhanced protection of patient confidentiality and privacy.¹ The hype has coincided with the proliferation of generative AI, which makes generating high fidelity synthetic data more achievable than it was previously.

Despite this, the uptake of synthetic data in the sector has been slower than many expected. A key reason for this (and a surprise to some) is concerns about data protection. A lack of clear guidance from data protection regulators in both the UK and EU has left organisations to their own devices when it comes to assessing whether synthetic data can be treated as anonymised in any given scenario. In this article, we review the current data protection landscape and consider what organisations can do to try to mitigate any privacy risks of using synthetic data.

Generating artificial patients

While the term ‘synthetic data’ does not have a universally accepted definition, it is generally recognised to encompass artificial data generated through a broad range of methods and technologies ranging from manual production to iterative manipulation of real world data. The current focus tends to be on artificial data that is generated from real world datasets using an AI model that replicates the trends, patterns and data points of these real world datasets. For example, you may have come across thispersondoesnotexist.com, a site that uses generative adversarial networks to generate a very believable image of the face of a person who is not real.

In a health context, the input dataset will generally contain patients’ personal data. The output data, on the other hand, is designed to relate to a fictitious patient, that is, to be ‘synthetic’. For example, Google has proposed a specific generative modelling framework called EHR-Safe. This proposes to produce synthetic data from patient electronic health records.

From training AI to treating rare diseases

The benefits to using synthetic data go beyond simply taking the dataset outside the scope of data protection laws and in the life sciences sector, include:

1. Developing AI training datasets:
Large synthetic datasets can be generated from smaller real world datasets by generating additional data points matching real world trends. Larger datasets are beneficial when looking to train and fine tune new AI healthcare tools, since massive amounts of training data are required to ensure statistical accuracy and precision.

2. Treating rare diseases:
In some circumstances, there is a natural shortage of real world data which can make obtaining statistically significant research findings difficult. For example, in the development of medicines used to treat rare diseases there is often a very limited pool of patients that can be enrolled in a clinical trial. To increase the number of patients, synthetic data could be used to effectively enrol artificial patients into the trial. This could cut down the costs associated with enrolling real world patients.

3. Populating control arm groups:
In some limited instances, a synthetic control arm can be used in clinical trials where the current treatment is well established and the progression of the disease is relatively predictable. In these cases, there is often a wealth of real world data that can be taken from previous trials and medical records and used to generate a synthetic dataset for the control arm. This approach has been recognised and considered by the FDA in a draft guidance document published in February 2023 in the context of designing externally controlled clinical trials.³ Again, there is a cost saving benefit here, particularly where the current treatment is expensive.

It’s not personal – or is it?

In the UK, the Information Commissioner’s Office (ICO) identifies the use of synthetic data as a privacy enhancing technology (PET) in its recently published guidance on PETs. As the ICO explains, by using PETs, an organisation can help ensure compliance with the data minimisation principle and demonstrate data protection by design and default. The use of synthetic data therefore appears to be an attractive proposal for life sciences organisations looking to work with large amounts of patient data. One of the promised benefits of synthetic data was that it would take the dataset outside the scope of data protection laws. However, the extent to which this is the case is currently unsettled.

The key question is whether the synthetic data relates to an identified or identifiable individual. If it does not, it sits outside the scope of data protection laws and the dataset can be used without having to consider a lawful basis for processing, data subject transparency, implementing data processing provisions in contracts or international transfer provisions where the research involves cross-border teams based outside the EEA. It also allows researchers to avoid using the Data Security and Protection Toolkit and complying with the Caldicott principles, which are all further factors that need addressing when dealing with NHS Data.

The ICO sees synthetic data serving as a proxy for the original real world data as a key risk. The closer the synthetic data mimics the real world dataset, the higher the risk of an attack, such as a model inversion attack, attempting to re-identify the original patient’s personal data. Where the synthetic data mimics the real world dataset too closely and outlying data points can be inferred to have originated from specific patients, the synthetic data will likely be considered personal data.

The greatest challenge is to pinpoint the exact threshold at which the risk of re-identification of a dataset reaches a sufficiently low level, to argue that it has been effectively anonymised (this is something we have covered in depth previously; see our article on anonymisation of genetic data here and our detailed paper co-authored with Privitar here). The ICO’s draft guidance on anonymisation and pseudonymisation, first published in 2022, contains some help on how to go about assessing re-identifiability but it remains high level and too legalistic to offer practical advice to life sciences organisations. At the time of writing, the ICO has paused its review of the guidance until 2024,⁴ leaving organisations in limbo on the extent to which the draft guidance can be relied on. What organisations really need from the guidance is technical advice setting out the practical measures they can take to ensure synthetic data remains anonymous.

The EU is taking a similar hands-off approach. The European Data Protection Board (EDPB)’s most recent guidance on the topic was published in 2014⁵ and while publishing updated guidance has been on the EDPB’s workplan since 2021-2022, we are yet to see any draft guidance. We understand that the outcome of the European Data Protection Supervisor (a regulatory body that monitors EU institutions’ compliance with data protection laws) appeal of case T-557/20 (a helpful and pragmatic decision in assessing identifiability, see our summary of the decision here), may be the reason for the delay in the publication of the EDPB’s updated guidance.⁶ It is to be hoped that the fact that the case has been appealed by the EDPS does not mean that the EDPB intends to take a stricter approach than the pragmatic one adopted by the judges in that case.

Absent any express remit for data protection regulators to do more than provide guidance for interested parties, there are limited sources of significant technological expertise to assess re-identifiability from a technical perspective for organisations with small budgets, including small biotech companies. The UK Anonymisation Network, established through ICO funding to advance anonymisation best practices, does offer consultancy services to organisations to assess the risk associated with a dataset. However, at the time of writing, the indicative price range for a risk assessment was £6,000-£20,000. The Office for National Statistics has also published a working paper which considers reidentification risk and analytic value but the guidance lacks detail to allow organisations to accurately determine where their synthetic data sits on the spectrum.

What about ‘accidental matches’?

We think this risk is a red herring, for reasons of faulty logic. To the extent that synthetic data is generated based on real world data, there is a statistical probability (albeit very low) that synthetic data accidentally ‘matches’ personal data of a real individual whose data was not part of the real world input dataset. Films, for example, sometimes carry a disclaimer to the effect that any resemblance to a real person is a coincidence and not intended by the filmmakers.

Advice provided by the ICO Innovation Hub in response to a question from a public body in the health sector suggests that it is possible that the ICO might also consider accidentally matched synthetic data as personal data, unless the number of matching individuals was so high that it was not possible to identify one individual. We do not think this approach is correct, given that such information does not relate to an individual. It would also be very challenging for any organisation to assess and mitigate this risk.

Perfection should not be the enemy of the good

From a pure data protection law perspective, there is of course nothing to stop organisations in the sector from moving forward with their synthetic data plans whilst guidance and case law remain unsettled, unsatisfactory though that state of affairs undoubtedly is. There are a number of safeguards that organisations can implement to reduce the privacy risks in the absence of clear guidance.

These come, conceptually speaking, in the usual three forms – technical, governance and contractual measures. An example of the first would be to introduce noise to the dataset, to make it more challenging to reidentify a patient in cases where the originating dataset includes real patients. The challenge is usually to preserve enough utility in the data when doing so not to make the exercise self-defeating. An example encompassing all three measures, which we and some others would in many cases consider overkill but which would be viewed favourably from a data protection perspective, would be to allow access only via a secure data environment. In a secure data environment, synthetic data is made available only to approved organisations, only on a secure platform (sometimes referred to as a trusted research environment) and without the ability to download the data. Such environments also involve vetting of the party seeking access to further reduce the risk of a bad actor obtaining access.

The excitement around synthetic data and its potential uses has outlasted the initial hype. Clear guidance from data protection regulators around its use would obviously be welcome, but that seems unlikely in the near future, for the reasons we have described. Instead, there is an opportunity for the sector to take the initiative and make the case as to how we can all come to live with synthetic data.