Application of deep learning techniques in a de-identification system
Clinical data, collected by healthcare providers when treating patients, is incredibly valuable for medical research – but good data is hard to get.

Summer of Research
Project by Yicheng Shi, University of Auckland, supervised by Quentin Thurier (Orion Health).
This information can help researchers improve healthcare. For example, by studying a patient’s medical history, researchers can analyse the cause of a disease and evaluate whether the treatment worked.
Most clinical data comes from electronic health records containing a patient’s information such as clinical notes, discharge summaries, nursing progress notes and lab reports.
But these records also contain patients’ patient’s names, contact details and other personal information. Patients have a right to privacy, and it is essential that health records used for research are anonymous. Each record must be de-identified before anyone other than the patient’s own healthcare team sees it.
Health records are usually stored as blocks of free text which in format, so each one must be individually analysed and de-identified. Because of this, hospitals and clinics rarely release such information.
Yicheng Shi has built a deep learning algorithm using a natural language processing technique called named entity recognition (NER) to find personal details in health records and mark them for removal.
The de-identification model uses context to tell whether a number such as “11/12” is a date of treatment – which is personal information – or a test result that could be important to research. It can also identify different abbreviations and ways of recording personal information.
Yicheng used three datasets: MIMIC III, the 2014 i2b2/UTHealth NLP dataset, and a New Zealand clinical dataset collected by University of Auckland student Nick James. He originally trained the algorithm using the New Zealand dataset which had personal information included, but this was too small for accurate training.
The MIMIC III dataset is much larger but is already de-identified. Yicheng inserted fake, but realistic, personal information into the MIMIC III health records and used this dataset to train the algorithm.
The i2b2 dataset was used to test the model.
The model breaks the text into chunks and analyses the meanings of words, given their context. Using word embedding, each word is given a vector that places it on a map where words with similar meanings are close by.
The algorithm identifies different types of personal information with high accuracy in the test dataset, but Yicheng says that accuracy would likely decrease in the real world where the model would encounter a lot of unknown words and have to guess their meaning.
The model could be improved by adopting a multi-layered neural network that analyses the characters in each word. This may identify personal information more accurately by evaluating words’ prefixes and postfixes.
It could also be made more useful by automatically replacing personal information with realistic but fake information, so health records look normal for research purposes.
Yicheng Shi is one of 10 students who took part in the Summer of Research programme funded by Precision Driven Health. The research is at an early “proof of concept” stage. The projects offer fresh insights into what healthcare will look like when precision medicine is widely used.