Humans are inconsistent creatures, especially when it comes to our health. During periods of serious illness, we are constantly in need of medical attention, and then we get well and might not see a health professional for years. This creates an extremely uneven Electronic Health Record (EHR), which is actually very complex to begin with, containing as it does a whole range of different data types – from written medical summaries to x-ray images.

Data science has traditionally dealt with a lot of data of the same type – clicks on ads, financial transactions, or ratings of movies we have watched. But health data is sporadic, messy, and varied. How, for example, can a model link a lab test result with clinical notes and images, to meaningfully translate the combination into the risk of a medical condition?

In short, the EHR poses a huge challenge for data scientists!

It also presents an incredibly rich source of information that health providers globally are coming to appreciate. Called its “secondary application”, the anonymised data that can be extracted from thousands of EHRs is immensely useful for a variety of clinical tasks including disease diagnosis, readmission prediction, automatic coding and future disease modelling.

In their research, Deep Representation Learning from EHR, Dr Edmond Zhang and his colleagues1 considered how to improve the application of deep neural networks to extract information from anonymous EHR data. Also known as deep learning, this complex machine learning technique is applied when you want to find patterns in hidden layers of data that would be undetectable to a human.

In recent studies using deep neural networks researchers have approached EHR data by concentrating on a single aspect, for example discharge summaries. In what we believe is the first study of its kind, Dr Zhang proposed taking a more holistic approach by finding ways to use multiple neural network architectures, so that all information in every EHR in the dataset is taken into account. He points out that this is the way that clinicians themselves work – they process all relevant observations in their natural form, such as discharge summaries (text), clinical images (images) and lab tests (numbers).

Using a large dataset containing around 60,000 ICU admission care records collected over 11 years from the Beth Israel Deaconess Medical Center in Boston (the MIMIC III dataset), Dr Zhang and his team examined how to automatically predict either ICD-9 diagnostic codes or lab test orders. To do this, the team randomly divided the data so that 90% was used for training the model and 10% for testing. They then ran the same experiments, with different random seeds, ten times with different train/test data sets. What they found was that the concept of bringing the “model to data” – that is, making the model adapt to suit the data, rather than the other way around – meant that all data could be taken into account to improve the quality of the outcomes. In addition, by combining the outcomes from multiple neural networks, prediction accuracy was improved in each area.

We are only beginning to tap into the enormous resource that large datasets of anonymised EHRs present, but we appear to have cracked one of the hardest challenges. As Dr Zhang and his colleagues have shown, being selective about data can never tell the full story – instead we need to adopt an inclusive approach when applying deep learning techniques to EHR datasets. After all, humans are complex creatures, so we should not expect the process of collecting and analysing data about them to be simple.

1 Dr Zhang was assisted in this research by Reece Robinson, Principal Engineer at Orion Health, and Professor Bernhard Pfahringer, Professor of Computer Science at the University of Auckland.