CMOR Lunch’n’Learn
6 June 2024
Ross Wilson
For concreteness, let’s consider the example of a clinical trial
Baseline/follow-up data
Patient-reported outcomes
Clinical/laboratory/performance tests
Linked health data
Following Rubin (1976), we recognise three categories of missing data mechanism:
MCAR – probability of being missing is the same for all data
MAR – probability of being missing depends only on observed data
MNAR – everything else
Strategies:
Minimise participant burden (long questionnaires, invasive tests, etc.)
Use incentives to encourage participant engagement and response
Adapt the mode of data collection to the study population
Follow up non-responders promptly
Complete case analysis
Include a ‘missing data’ indicator variable
Likelihood-based approaches
Impute missing data
Bayesian approaches
Unbiased (but usually inefficient) if missing data are MCAR
Usually biased (possibly severely) otherwise
Set missing values in a variable to zero (or some other relevant value)
Add a ‘missingness’ indicator variable to the (regression) analysis
Unbiased and efficient, but complicated and relies on untestable assumptions about the underlying ‘true’ model
Related to multiple imputation and Bayesian methods
There are lots of ways of doing this:
Mean imputation
Last observation carried forward (LOCF)
Regression imputation
Stochastic regression imputation
Multiple imputation
This is generally considered the optimal approach for dealing with missing data
In a nutshell:
Predict the expected values of missing data based on observed data
Randomly draw imputed values for the missing data from these predictions
Conduct the intended analysis with this ‘filled-in’ dataset
Repeat 2 & 3 multiple times
Pool the results from each repeated analysis to get combined estimates
From a Bayesian perspective, missing data can be viewed as unknown parameters of the underlying model
In principle, this is very similar to the multiple imputation approach, except that the ‘predict missing values’ and ‘estimate the model’ steps are combined instead of separated
Should we care about missing data?
If yes, what should we do about it?
Multiple imputation (usually)
Or Bayesian models
Flexible Imputation of Missing Data, 2nd edition. Stef van Buuren. Chapman and Hall/CRC (2018). Freely available online: https://stefvanbuuren.name/fimd
Applied Missing Data Analysis. Craig K. Enders. New York, NY: The Guilford Press (2010)
Statistical Analysis with Missing Data, 3rd edition. Roderick J. A. Little and Donald B. Rubin. Hoboken, NJ: John Wiley & Sons (2019). The 2nd edition (2002) is freely available from the publisher: https://doi.org/10.1002/9781119013563