Select Page

Multiple imputation for better data

When it comes to analysing health data, medical professionals often have to contend with incomplete data sets.

This article will discuss multiple imputation, a method used to handle missing data. We’ll also discuss how multiple imputation can be used in a healthcare setting to reduce bias in data to address inequity.

When it comes to analysing health data, medical professionals often have to contend with incomplete data sets.

They’re not alone: missing data is common in statistics, where even theoretically-complete administrative data can suffer from some fields not being filled in, or from impossible values being removed during data cleaning.

If this missing data isn’t handled correctly, it can result in negative outcomes such as bias, complications and ultimately invalid conclusions.

One way to handle missing data is through a process known as ‘multiple imputation’ – a technique that can be used for health data with missing values to help reduce bias when training predictive models, and for increasing applicability when deploying the models.

As machine learning techniques and technology continue to develop, the power of multiple imputation to reduce bias in missing data is growing, providing another opportunity to reduce bias – as shown in a Precision Driven Health-supported research project providing a practical guide to help data scientists apply multiple imputation.


How does imputation work?

Professor Thomas Lumley of Waipapa Taumata Rau, The University of Auckland – the author of How and Why to use Multiple Imputation – says the idea of imputation “is to fill in the gaps in the data.

“A simple version of [imputation] is trying to put in the most likely value for missing values, which is what Stats New Zealand does with the New Zealand census. They want to get a single value for each person, so they pick what they think is the most likely value. This is known as single imputation, and its strength is that it’s much more straightforward.”

All single-imputation methods suffer though from the problem of ‘making up data’. While this method completes the data set, there’s no simple way to know how much of the statistical information in the data is real and how much was created from nothing.

There’s also a risk of stereotyping, Professor Lumley continues. “Using smoking as an example, if you can only fill in a single ‘yes’ or ‘no’ value, the most likely value is going to be ‘no’. Whatever group you’re looking at, most people don’t smoke, but some people do – so your imputation will be biased towards the most common possibility.”

With multiple imputation, which is preferable when feasible, a range of values is used “that represents the uncertainty you have about the missing data,” Professor Lumley continues.

“If somebody hasn’t told you whether they smoke or not – and there may be 20 percent of people likely to be a smoker and 80 percent non-smokers – you’d like a set of values where eight out of ten of them say that person is a non-smoker, and two out of ten say that person’s a smoker.

“You’re representing the uncertainty you have about that individual person smoking. And you can do that for any sort of [data].”


Reducing bias in data to address inequity

In Aotearoa New Zealand, Māori are underrepresented in health data, due to a combination of dynamics related to access and digital inclusion, and the quality of data both identifying ethnicity and about people within minority populations tends to be lacking.

Applying multiple imputation means that analysis, and the policy and care decisions that result from this analysis, can be fully representative and therefore counter the bias that can be created by traditional approaches.

In addition, multiple imputation is a better approach than omitting records with missing observations from data analysis and model fitting, as has previously been the case.

Professor Lumley says there’s a number of implications of the traditional approach of omitting records. “You end up discarding a lot of data, which is wasteful – especially when you had to ask patients or other study participants for it.

“Probably more important though, is that the people who have missing data are often different from the people who don’t have the missing data. You not only end up with a smaller set of data, but it’s also a biased set of data. You end up studying the sorts of people who have provided complete data – and they can be very misleading.”


Imputation in healthcare

In healthcare settings, people who have missing data may be at systematically lower risk if measurements are only recorded based on clinical need, or at systematically higher risk if missing data reflects poorer access to care or higher clinician workload.

“We’ve seen [through programmes like the census] that data collection is often not as good for Māori, for many reasons. If you’re analysing administrative or otherwise automatically collected data, the results for Māori may well be less accurate.”

For example, in the 2018 New Zealand Census (External Data Quality Panel, 2019, table 4.8), 16% of respondents to the question on ‘Māori descent’ indicated they did have Māori descent, but Stats New Zealand estimate that 48% of those who did not respond had Māori descent.

Professor Lumley says: “The good implications of multiple imputation is that if you don’t have data on ethnicity, but you do have information that’s related to it, you’ve got some hope of filling in ethnicity and not overpredicting it. You can be honest about how much you know, and how much you don’t know.”

“Multiple imputation lets you both take advantage of those correlations, but also not overstate how much you know, so that your imputed data will still reflect your uncertainty about what the person’s actual ethnicity is.”


An interim step

Whereas under traditional approaches, incomplete data would have been omitted, multiple imputation is what Professor Lumley describes as “an interim step” until better data can be collected to help inform better decision making.

“Having access to better data is the goal, which requires greater involvement from groups you need data from in survey design. Getting input from the communities on good or bad ways to collect data, and on good projects that are worth doing, is important for better data.”

“If you can impute the missing values [where high quality data isn’t available], that gives you more ability to include everyone. You get to analyse a less biased subset of people when you’re comparing by ethnicity, for example.

“You can’t reduce [bias in data] completely through multiple imputation, but the more data you have, the more ability you have both to fill in some of the missing information and get a good idea of how much information you’re losing. You’re not just collapsing your data to a convenient, well-behaved subset.”



(1). This means that the resulting model will generalise better to different domains or applications as each application can introduce their own skews (e.g. some domains are more skewed towards men, others are towards women)

(2). p.8

(3). p.7

(4). p.7