This article is the fourth in our "Trends in digital health" series.
Synthetic data itself is not a new concept – on some analyses, 2023 is its 30th birthday - but in the life sciences sector, the last three years have seen rapid developments suggesting synthetic data is going to spur some significant changes in how the industry undertakes core processes.
During the COVID-19 pandemic, a time when the process of running a clinical trial was flipped on its head, the MHRA produced two synthetic datasets based on anonymised primary care data for use in responding to the COVID-19 pandemic. More recently, Google has proposed a generative modelling framework called ‘EHR-Safe’ which proposes to produce synthetic data from electronic health records.
But how much traction does the implied promise of faster and better trials, greater development of international research projects, and enhanced pharmacoepidemiology actually have?
Below, we explore the facts behind the hype.
What is Synthetic Data?
Synthetic data can be described as ‘artificial’ data, generated from a real world dataset using a machine learning model that replicates the trends, patterns and data points of that real world dataset. While the real world dataset in a health context will inevitably contain patients’ personal data, the equivalent synthetic data is designed to relate to an ‘artificial’ patient rather than an identifiable natural person and therefore sits outside the scope of data protection laws.
The ICO, in Chapter 5 of its draft anonymisation, pseudonymisation and privacy enhancing technologies guidance, identifies the use of synthetic data as a privacy enhancing technology (PET). As the ICO explains, by using PETS, an organisation can help ensure compliance with the data minimisation principle and demonstrate data protection by design and default. The use of synthetic data therefore appears to be an attractive proposal for companies in the health tech space looking to work with large amounts of data, in particular companies developing AI systems which require vast amounts of training data.
What are the benefits?
There are a number of potential benefits to using synthetic data in place of real world data. These include:
- Reducing regulatory restrictions. It is often a challenge to pinpoint the exact stage at which the risk of re-identification of a dataset reaches a sufficiently low level, so as to argue that it has been effectively anonymised, (this is something we have covered in depth previously; see our article on anonymisation of genetic data here and our detailed paper co-authored with Privitar here). To the extent that synthetic data relates to an artificial patient and not an identifiable natural person, it logically follows that such a dataset will be considered anonymised. This means that the dataset can be used without having to consider restrictions imposed through patient consents and the requirements of data protection law. It also allows researchers to avoid using the Data Security and Protection Toolkit and complying with the Caldicott principles, which are all further obstacles to overcome when dealing with NHS Data.
- Increasing the size of databases. Large synthetic datasets can be generated from smaller real world datasets by increasing the number of data points matching the real world trend. This has benefits in a number of areas, such as training AI healthcare tools, which require massive amounts of training data to ensure statistical accuracy and precision, and for the development of orphan drugs, where there is a natural shortage of real life data. This also leads to cost-saving benefits when compared with collecting real world data of the same magnitude. After all, running clinical trials isn’t cheap.
- Maintaining data utility. Anonymisation often involves removing certain useful data points in an attempt to avoid the risk of re-identification. Anonymisation is also often a costly and time intensive task and at present, it is challenging to be certain that the threshold for successful anonymisation under data protection laws has been met. Synthetic data on the other hand allows key data points from the dataset to be preserved, meaning you are left with a far richer dataset than an equivalent anonymised real world dataset (but see our note below on re-identification risk).
What is holding the technology back at present?
- Personal data required for initial production of synthetic data. In order to produce a synthetic dataset, a real world dataset containing personal health data will have to be processed. Importantly, therefore, controllers cannot avoid the obligations of having to establish a valid legal basis, provide notice to patients, and meet the other relevant requirements under data protection law when producing the synthetic data.
- Potential for bias. The utility of the synthetic data is also reliant on the quality of the real world data and model used to replicate it. Use of a non-representative or biased real world dataset will inevitably result in a non-representative synthetic dataset. Likewise, the model used should be sufficiently trained and tested to ensure that it does not introduce biases. As the saying goes, you get out what you put in.
- Risk of re-identification. The closer the synthetic data mimics the real world dataset, the higher the risk that an attack, such as a model inversion attack, might be used to re-identify the original patient’s personal data. Where the synthetic data mimics the real world dataset too closely and outlying data points can be inferred to have originated from specific patients, the synthetic data might even be considered personal data. Chapter 5 of the ICO’s draft guidance highlights the risk of such attacks and suggests that “[u]sing differential privacy with synthetic data can protect any outlier records from linkage attacks” but of course this comes at the expense of dataset utility.
- Further development required. Despite the hype, synthetic data is not yet a mature technology. While modelling frameworks such as EHR-Safe have been proposed, there is no tried and tested model available which organisations can rely on to produce a good output dataset. Until models have been further tested, the risk of bias and re-identification remains a concern for many, particularly in the public sector.
A thought to leave you with
The question people are asking with synthetic data is when it will take off, rather than if. In our view, an initial surge in its use in the next couple of years is plausible in the health tech sector, and it is certainly “one to watch” for the future. There are still a number of challenges for the technology to mature, however, so don’t go tearing up the Record of Processing and Privacy Notice just yet…
 In the context of privacy-preserving statistical analysis, the idea of synthetic data was proposed by Rubin in 1993 (see Rubin, D.B. (1993), Discussion: Statistical Disclosure Limitation, Journal of Official Statistics, 9(2), 461–468).
 See ICO’s Enforcement Action against the Royal Free for an example of where use of Synthetic Data in development and testing of a digital health app would have removed the need for compliance with GDPR obligations.