De-identifying data – balancing utility vs privacy

mother helping her daughter with homework

De-identifying data – balancing utility vs privacy

In Aotearoa New Zealand, organisations have clear obligations when collecting, storing, using and disclosing personal information. Privacy is paramount – and that’s certainly the case for organisations in the health sector.


Data is also essential for making decisions that lead to the best outcomes for patients, though. So how can data be best used, while protecting the identity of the person who it relates to?


That’s where an important process called ‘de-identification’ can help. When data is de-identified, it means that a person’s identity is no longer apparent or cannot be reasonably found out from the data in question. 


The flip side of this, however, is that de-identification can lead to bias if important data like ethnicity is removed. Indeed, there is a need to conduct de-identification in a way that preserves the information that can help deliver the best outcomes for all patients. 


This was the goal behind a Precision Driven Health (PDH) research project to create a tool that de-identifies text data in Electronic Health Records (EHR) so it can be used for research, and to understand the implications of de-identification on use of the resulting data.


Free-text sections of medical data – sometimes called unstructured data – are hard to de-identify because they often contain identifying information mixed in with other words and abbreviations.


This often means the data can’t be used to improve healthcare or for research purposes as it’s difficult to de-identify. Being able to take out identifiers means that data can be used to help benefit patients without risking their privacy.


While there are some software tools available for de-identification, they are developed overseas and do not specifically consider the uniqueness of New Zealand’s population.


Developing de-identification tools within New Zealand, using suitably diverse New Zealand data, ensures that the resulting models and rules are more relevant and accurate for our population with less data suppressed or undetected.


The tool developed through PDH research, De-identifier, guides users through the entire safe data process of taking ‘raw’ data and turning it into regulation-compliant, de-identified data, incorporating both utility and privacy to provide outcomes for patients.


Data de-identification requires a balancing act between utility and privacy. In the New Zealand context, the Māori population is large enough to mean ethnicity data as usually recorded doesn’t need to be ‘masked’ – where selected information is concealed or encrypted to protect individual identities – to remain safe to use for analysis purposes.


Care needs to be taken when health providers and researchers want to analyse data from patients with factors that could make the data more identifiable, like rare conditions or less common ethnicities.  


Similarly, iwi-level data can make the data more identifiable due to smaller numbers in the groups defined by iwi variables, especially combinations of iwi identifiers.


Data from these smaller groups may become more easily re-identifiable, so may need to be aggregated or grouped during an analysis to preserve individuals’ privacy.


Andrew Sporle – who teaches in Statistics at the University of Auckland, is a founding member of Te Mana Rauranga the Māori Data Sovereignty network, and a member of PDH’s Independent Advisory Group – says that confidentiality is a key concern.

“De-identification work is a really key part of the picture for when we start to look at health data, and especially pathways of patients through health care. We’re beginning to get data systems that can do that, and Waitematā and the PDH partnership are leading that in New Zealand.”

Andrew Sporle