Handling missing data

Hello all!

I have collected patient data retrospectively but for some variables (both numerical and categorical) data is missing.

I have 13 variables for which data is missing ranging between 2 and 369 missing cases on a total of 486 cases (so complete data is acquired if for some variable 486 cases exist). The variables for which the most data are missing are numerical (pain scale from 0-10). I have tried to figure out how I should deal with this but do not fully understand to what extent data can be imputed. For instance, if only 2 cases are missing I can imagine that won't be too big of a problem but when 369 cases (76%!) are missing I am not sure this is feasible. Using SPSS I have calculated the EM mean (0,332) which can be interpreted as the data being randomly missing (is this correct?). So I think my question is, to what extent can data be imputed and what would you advise to do with variables for which a great part of data is missing? Would you describe it in you paper but with the note that the data was only derived from a small portion of the study population?

Thank you for your help in advance!

Kind regards,



Less is more. Stay pure. Stay poor.
Which variables are needed to answer you question and how much data is missing for those variables? Also, what is mean EM? Some expectation maximization?
My primary question was what the efficacy was of a certain injection treatment in two groups with different causes for the same disease. That way I wanted to compare whether the efficacy was different between the groups. In order to answer this question we used two variables: a pain score (Visual Analogue Scale, 0-10; before treatment and after to find a change in pain score) and a yes/no question for the patient whether they considered the injection to be successful.

For the latter variable: 7/149 were missing in one group and 10/337 in the other group
For the pain score variable before treatment: 89/142 and 172/337 were missing
For the pain score after treatment: 119/149 and 245/337 were missing

So, there's actually a very large proportion of the population that lacked data on a validated pain score scale. If available such a scale would give you more information on the effect of the injection, but since it is not readily available we considered assessing this question using the yes/no answers from the patients.

EM is expectation maximization indeed!


Less is more. Stay pure. Stay poor.
How many patients do you have complete data on across variables?

To have MAR data you are able to explain missingness. So do you have data on why they are missing it?

Side note, why do you think there is missing data, just not documented? If so, no imputation needed if it is random. If missing for a reason, imputation needs to occur, otherwise you risk selection bias. With missing data and no placebo or randomization you have a strong risk for biases.

I am glad to see that you realize the need to control for baseline pain!
Big thank you for your help!

So I have a total of 486 patients for which I have (complete or incomplete) data.
For most baseline characteristics I have 461 cases for which I have complete data. If I include baseline pain score I only have complete data for 223 cases.
There are a few secondary baseline characteristics for which I do not have complete data for all cases. For instance, one of the baseline variables is the use of pain medication (yes/no). For this variable we have complete data for 461 patients.
A second variable considers 4 categories of pain drugs (paracetamol, NSAIDs, opiods and anti-epileptic drugs). However, some patients told the physician that they used two types of pain drugs but could only remember the name of one. Therefore, we know that they use pain medication but do not know exactly which type and therefore have missing data for this secondary variable. But I am not sure if I should drop cases listwise taking this variable into account as well since I consider this secondary variable to be less important.

We have several outcome variables: pain scores before and after treatment, whether the patient considered the treatment to be effective in relieving pain 8 weeks after treatment (yes/no), whether the patient received additional injections (yes/no), whether the patient considered the treatment to be effective in relieving pain 16 weeks after treatment (yes/no), whether the patient received surgery at any moment after our treatment and what the eventual patient outcome was after all treatments (3 point Likert scale, unsatisfactory, satisfactory or good).
I only have complete data for 85 patients on all these variables. However, if I do not consider the pain scores (which were usually not documented unfortunately) I have complete patient data for 378 cases.

Thus, I have complete data (most relevant baseline characteristics and patient outcome variables) for 354 cases. This is however without pain scores which is a pity as the use of a standardized pain scale would be a huge strength for my paper.

I am not sure whether to call it missing data or not, but we retrospectively reviewed patients through their electronic medical record from the hospital and if data was 'missing' it means that it was not documented in the record.


Less is more. Stay pure. Stay poor.
I would think pain score is very important. Was the treatment randomized or everyone got it? I will rephrase your issue, say you have a weight loss intervention and your don't know the patient's initial weight and you don't randomize the intervention. One group may have all of the people with a BMI of 25-30 and the other has the obese, etc. Comparing outcomes between the groups would be very misleading - since the one group will only lose a couple of pounds since they are marginally overweight.

Do you think you can impute the baseline pain score using baseline covariates? Also, it may be trickier than you think, since VAS isn't categorical our continuous - but ordinal - right. Perhaps you run the model with and without the imputed VAS and the VAS variable and see how sensitive results are to it's inclusion.
Yes, I fully agree that pain score is important. The treatment was not randomized. It's a cohort study, so every patient received the treatment. We have formed two groups that we compare based on the cause of their symptoms. Our hypothesis was that the treatment efficacy is not different between those groups. But you need comparable groups at baseline of course to be able to make a solid comparison. However, I am doubting whether I should only use patients with a VAS score at baseline which will halve my population or to use all cases for which I have data on the effect of the treatment (yes/no question and Likert scale). This latter option would give a larger population to be used for analysis.

I might possibly be able to impute the baseline pain score using baseline covariates. For all patients we have age, gender, duration of symptoms, level of injection (not expected to influence imputation), cause of symptoms on MRI and history of back surgery. For 94.9% of these cases I also have data whether they used pain medication which I think can really help to impute the baseline pain score. However, that would mean that I would be imputing the baseline pain score for more than half of my patients.

And you are completely right, it's an ordinal scale. Is this imputation possible using SPSS software?
I applied multiple imputation using the imputation feature in SPSS. I am not completely sure how to analyze the new data set though. When performing imputation one needs to choose the number of imputations for a missing value. It is set at 5 by default. This means my data set (i.e. population size) becomes 6 times as large as originally. Is one supposed to use all the data for further analysis? Or could I calculate the mean of those 5 values and use that mean to replace the missing value in my original data set and thereby keeping my original population size?

If I only create 1 'extra' data set and use those imputed values to replace the missing values (and then delete the imputed data thus keeping my original population size but now without missing values) and consequently compare the mean baseline VAS, the mean VAS is lower with the imputed values than without the imputed values (mean±SD: 8.04±1.297 for group 1 and 7.85±1.459 for group 2 vs. 7.6±1.779 for group 1 and 7.13±1.833 for group 2).


Less is more. Stay pure. Stay poor.
You are supposed to run the analyses on each individual imputed set then pool the results for the 5 models. Doing this accounts for the possible uncertainty in the imputation process. You could probably bump the number of imputes up to 10. It may also be of interest to randomly remove some initial pain scores and impute them as a quality measure to examine how good of a job the imputation does.
Thank you for your instructions, you are amazing!! This has been incredibly helpful!

I used the multiple imputation feature that's built-in into SPSS. However, I have one variable that is gives me some concerns. We noted whether patients used pain medication, but for some patients it was only documented that they were on pain medication but not specifically what type of medication (acetominophen, NSAIDs, opiods, etc.). If type of pain medication was specified we reported this. However, if SPSS imputes missing values for whether the patient was on pain meds and it gives a zero (no medication) it MUST also give zeroes for all three types of pain meds. Unfortunately, I cannot find a way to give the system this restriction. Thus, for some patients for which the system predicted that they used no pain medication, the system also predicted that they used an opiod which is obviously contradictory. Do you know how to deal with this problem?
Last edited:


Less is more. Stay pure. Stay poor.
Well, which do you respect more. I would imagine if it said not on one and you had sufficient data for this impute, then you should respect that and make the latter not applicable or go back and remove the latter. If someone inputted I don't have a job and said I had a 50K salary - I would say the salary part should be removed - but you know your data better than I do.