Handling missing data

CMOR Lunch’n’Learn

6 June 2024

Ross Wilson

Missing data are ubiquitous in health research

What are the reasons for missing data?

For concreteness, let’s consider the example of a clinical trial
- Baseline/follow-up data
- Patient-reported outcomes
- Clinical/laboratory/performance tests
- Linked health data

Two questions:

Should we care about missing data?

If yes, what should we do about it?

Should we care?

Or, more specifically;

When should we care?

And when is it likely to be less important?

Amount of missing data

More missing data => more of a problem

But there is probably no general ‘rule of thumb’ as to how much is too much

Missing data ‘mechanism’

Describes the probability that each data point will be missing

Following Rubin (1976), we recognise three categories of missing data mechanism:
- MCAR – probability of being missing is the same for all data
- MAR – probability of being missing depends only on observed data
- MNAR – everything else

What should we do?

Practical

Try to minimise the amount of missing data

Strategies:
- Minimise participant burden (long questionnaires, invasive tests, etc.)
- Use incentives to encourage participant engagement and response
- Adapt the mode of data collection to the study population
- Follow up non-responders promptly

Statistical

Complete case analysis
Include a ‘missing data’ indicator variable
Likelihood-based approaches
Impute missing data
Bayesian approaches

Complete case analysis

Simply drop any observations with missing data

Unbiased (but usually inefficient) if missing data are MCAR
Usually biased (possibly severely) otherwise

But very common – check the sample sizes for different analyses (if reported) in published studies

Indicator approach

Set missing values in a variable to zero (or some other relevant value)
Add a ‘missingness’ indicator variable to the (regression) analysis

Unbiased under some (restrictive) conditions, but biased in general

Likelihood-based approaches

Define and estimate a statistical model for the observed data (including probability of being observed)

Unbiased and efficient, but complicated and relies on untestable assumptions about the underlying ‘true’ model
Related to multiple imputation and Bayesian methods

Impute missing data

Replace each missing data point with a replacement value

There are lots of ways of doing this:
- Mean imputation
- Last observation carried forward (LOCF)
- Regression imputation
- Stochastic regression imputation
- Multiple imputation

Multiple imputation

This is generally considered the optimal approach for dealing with missing data
In a nutshell:
1. Predict the expected values of missing data based on observed data
2. Randomly draw imputed values for the missing data from these predictions
3. Conduct the intended analysis with this ‘filled-in’ dataset
4. Repeat 2 & 3 multiple times
5. Pool the results from each repeated analysis to get combined estimates

Bayesian approaches

From a Bayesian perspective, missing data can be viewed as unknown parameters of the underlying model
- Just as e.g. treatment effects are unknown parameters of the model

In principle, this is very similar to the multiple imputation approach, except that the ‘predict missing values’ and ‘estimate the model’ steps are combined instead of separated
- The imputation step of multiple imputation is essentially sampling from the posterior distribution of the missing data

Summary

Back to our two questions:

Should we care about missing data?
- Yes (usually)

If yes, what should we do about it?
- Multiple imputation (usually)
- Or Bayesian models

Resources

Flexible Imputation of Missing Data, 2^nd edition. Stef van Buuren. Chapman and Hall/CRC (2018). Freely available online: https://stefvanbuuren.name/fimd
Applied Missing Data Analysis. Craig K. Enders. New York, NY: The Guilford Press (2010)
Statistical Analysis with Missing Data, 3^rd edition. Roderick J. A. Little and Donald B. Rubin. Hoboken, NJ: John Wiley & Sons (2019). The 2^nd edition (2002) is freely available from the publisher: https://doi.org/10.1002/9781119013563