Missing Health Data ImputationMedical professionals analysing electronic health data often have to deal with data sets that are incomplete. If this data is not handled correctly it can result in negative outcomes such as bias, complications and ultimately invalid conclusions.
March 2018 – February 2019
Medical professionals analysing electronic health data often have to deal with data sets that are incomplete. If this data is not handled correctly it can result in negative outcomes such as bias, complications and ultimately invalid conclusions.
Finding ways to deal with missing data is at the heart of the project Value added by Multiple Imputation on Real World Datasets led by Jiunn Howe Lee and University of Auckland Professor Thomas Lumley. The team is investigating the use of a technique known as Multiple Imputation (MI), which has been around for over three decades, but which is being re-imagined in light of advances in machine learning.
Until now the adoption of MI as a method has been relatively low, and this is probably due to a couple of reasons – it is perceived to be too complex, and a lot of researchers are not aware of it and/or the dangers of not handling missing data correctly.
This research evaluates MI techniques and their impact on prediction models. It is compares three methods in particular – the classical MI based model called MICE, which is based on chained equations; missRanger, a modern machine learning-based MI using chained tree ensembles; and MIDAS, another modern MI approach, this time using denoising autoencoders based on neural networks.
For those not familiar with the various machine learning methods it may be easier to think about the desired outcomes of the research, it will:
- evaluate the impact of MI on prediction models,
- evaluate more modern machine learning MI approaches against traditional methods, and
- explore what machine learning brings to the table, especially in terms of scalability, and ease of configurability.
Projects such as this illustrate that healthcare is increasingly becoming a mathematical science, particularly in the field of research. But if the supposition is proven correct, then the benefits can be measured in better health outcomes for all citizens, particularly those from minority populations where datasets are smaller and as a result are more prone to suffer from data quality issues.