Automated De-identification

The idea that a patient’s Electronic Health Record (EHR) has a secondary value, beyond the immediate treatment of a single individual, is becoming more prevalent as medical research incorporates machine learning (ML) practices.
cute girl sitting in between her grandparents

Principal Researchers

  • Junjae Lee, Orion Health Dominic Yuen, Orion Health
  • Professor Thomas Lumley, University of Auckland

Timeline

October 2017 – January 2019

Separating patients from their records

The idea that a patient’s Electronic Health Record (EHR) has a secondary value, beyond the immediate treatment of a single individual, is becoming more prevalent as medical research incorporates machine learning (ML) practices. The information an EHR contains, such as symptoms, treatment plans, response to medication, can be extracted using ML techniques to create a more fulsome – and therefore more useful – knowledge base about a particular health condition. As analytical and ML-based research starts to produce valuable predictive models that offer deeper insights into patient healthcare, the need increases for a wider range and a greater volume of raw data.

While Electronic Health Records (EHRs) have been around for decades, a huge barrier to use is gaining the patient’s consent, especially from large numbers of patients, to use the data. It takes considerable time and effort, which is why the idea of not having to gain consent is so attractive. The only way to remove the necessity for consent is to ensure that data can be de-identified in such a way that it can’t be associated with an individual. The popular de-identification method that exists today is HiPAA Safe Harbour [CFR45 164.514(b)(2)] but it is considered a ‘one size fits all’ approach and there is some discussion about whether the data is ‘de-identified enough’ or if the method produces useful data for all cases.

The era of precision driven health requires a ‘precision driven de-identification application’ and that is exactly what Principal Researchers Junjae Lee, Dominic Yuen and Professor Thomas Lumley are working on. It will be a framework that guides users through the entire de-identification process, starting with the raw data, to produce the optimal set of usable data by considering the context of the data recipient. Both those in charge of health data, and those who want access to it, will benefit greatly from this application. It will assist data managers, holders and suppliers who aren’t familiar with de-identification techniques. And it will enable researchers to see that by following privacy rules and strengthening their security practices, they can get access to more data, which in turn allows them to build more accurate predictive models. The application may mean it is possible for datasets to be re-used for other research projects, depending on the appropriate ethics approvals.

Once the baseline de-identification tool is established for ‘structured data’ in EHRs, the researchers will look to other areas such as ‘HL7 messaging’ – the international standards for transfer of clinical and administrative data between software applications. Good anonymisation has often been sought for HL7 messaging but has not been achieved, yet.