Handling missing data

CMOR Lunch’n’Learn

6 June 2024

Ross Wilson

Missing data are ubiquitous in health research

What are the reasons for missing data?


  • For concreteness, let’s consider the example of a clinical trial

    • Baseline/follow-up data

    • Patient-reported outcomes

    • Clinical/laboratory/performance tests

    • Linked health data

Two questions:


  • Should we care about missing data?


  • If yes, what should we do about it?

Should we care?

Or, more specifically;


  • When should we care?


  • And when is it likely to be less important?

Amount of missing data


  • More missing data => more of a problem


  • But there is probably no general ‘rule of thumb’ as to how much is too much

Missing data ‘mechanism’


  • Describes the probability that each data point will be missing


  • Following Rubin (1976), we recognise three categories of missing data mechanism:

    • MCAR – probability of being missing is the same for all data

    • MAR – probability of being missing depends only on observed data

    • MNAR – everything else

What should we do?

Practical


  • Try to minimise the amount of missing data


  • Strategies:

    • Minimise participant burden (long questionnaires, invasive tests, etc.)

    • Use incentives to encourage participant engagement and response

    • Adapt the mode of data collection to the study population

    • Follow up non-responders promptly

Statistical


  • Complete case analysis

  • Include a ‘missing data’ indicator variable

  • Likelihood-based approaches

  • Impute missing data

  • Bayesian approaches

Complete case analysis


  • Simply drop any observations with missing data


  • Unbiased (but usually inefficient) if missing data are MCAR

  • Usually biased (possibly severely) otherwise


  • But very common – check the sample sizes for different analyses (if reported) in published studies

Indicator approach


  • Set missing values in a variable to zero (or some other relevant value)

  • Add a ‘missingness’ indicator variable to the (regression) analysis


  • Unbiased under some (restrictive) conditions, but biased in general

Likelihood-based approaches


  • Define and estimate a statistical model for the observed data (including probability of being observed)


  • Unbiased and efficient, but complicated and relies on untestable assumptions about the underlying ‘true’ model

  • Related to multiple imputation and Bayesian methods

Impute missing data


  • Replace each missing data point with a replacement value


  • There are lots of ways of doing this:

    • Mean imputation

    • Last observation carried forward (LOCF)

    • Regression imputation

    • Stochastic regression imputation

    • Multiple imputation

Multiple imputation

  • This is generally considered the optimal approach for dealing with missing data

  • In a nutshell:

    1. Predict the expected values of missing data based on observed data

    2. Randomly draw imputed values for the missing data from these predictions

    3. Conduct the intended analysis with this ‘filled-in’ dataset

    4. Repeat 2 & 3 multiple times

    5. Pool the results from each repeated analysis to get combined estimates

Bayesian approaches


  • From a Bayesian perspective, missing data can be viewed as unknown parameters of the underlying model

    • Just as e.g. treatment effects are unknown parameters of the model


  • In principle, this is very similar to the multiple imputation approach, except that the ‘predict missing values’ and ‘estimate the model’ steps are combined instead of separated

    • The imputation step of multiple imputation is essentially sampling from the posterior distribution of the missing data

Summary

Back to our two questions:


  • Should we care about missing data?

    • Yes (usually)


  • If yes, what should we do about it?

    • Multiple imputation (usually)

    • Or Bayesian models

Resources

  • Flexible Imputation of Missing Data, 2nd edition. Stef van Buuren. Chapman and Hall/CRC (2018). Freely available online: https://stefvanbuuren.name/fimd

  • Applied Missing Data Analysis. Craig K. Enders. New York, NY: The Guilford Press (2010)

  • Statistical Analysis with Missing Data, 3rd edition. Roderick J. A. Little and Donald B. Rubin. Hoboken, NJ: John Wiley & Sons (2019). The 2nd edition (2002) is freely available from the publisher: https://doi.org/10.1002/9781119013563