Survey of machine learning-based approaches to de-identification of medical documents

Protecting patient confidentiality while using medical data from electronic health records (EHRs) for research is crucial. De-identification of EHR data provides the opportunity to use the data for research without the risk of breaching patient privacy, and avoids the need for individual patient consent.

cute girl sitting in between her grandparents

Summer of Research

Project by Vithya Yogarajan, supervised by Michael Mayo, University of Waikato.

Protecting patient confidentiality while using medical data from electronic health records (EHRs) for research is crucial. De-identification of EHR data provides the opportunity to use the data for research without the risk of breaching patient privacy, and avoids the need for individual patient consent.

De-identification problems can range from identifying a patient’s name from a discharge summary, to a patient’s city, occupation and even number of children. Medical text de-identification involves eliminating identifiable information about a patient from free-text documents. But because of complexity, automated de-identification is not used, meaning organisations often manually de-identify documents. This is a very time consuming and expensive exercise.

This project focused on a literature review of the most recent developments (since 2012) of automatic de-identification for long clinical narratives, specifically development around 2014 i2b2 UTHealth, and 2016 CEGS N-GRID shared tasks on de-identification of longitudinal clinical narratives.

De-identification competitions such as those mentioned above provide great opportunities for researchers to further develop research in this area. One of the major reasons these competitions are so widely recognised is the open source nature of the data. Due to the difference in the dataset and the content of the data, both competitions provided very interesting developments in this area of research.

In general, teams found the de-identification of the 2016 dataset to be more difficult than that of 2014. In addition, HIPAA PHI categories were easier to de-identify than that of i2b2 PHI (for 2014 data) or N-GRID PHI (for 2016 data).

One of the main concerns with these de-identification systems is how well they will perform with other datasets. Unfortunately, EHR data differs in each organisation, and sometimes even in each department. It is a good start for building models that can be tuned to new data, and with more competitions and new data, systems can be trained and better accuracy might be achieved.

These competitions have provided an excellent platform for developing and improving de-identification systems, but there is still much more than can be done in the de-identification space. The era of precision health requires a ‘precision driven de-identification application’ and that is exactly what Principal Researchers Junjae Lee, Dominic Yuen and Professor Thomas Lumley are working on. To learn more about the work they are doing on de-identification, click here.