missing data/multiple imputations


Fortran must die
They often talk about whether the mechanism that generates missing data is"ignorable." I don't really know what this means in practice. Also how do you reasonably know if the process generating missing data is MNAR or MAR - I have no idea what the underlying process would be in our data [with part of it counselors fail to fill in data that is required, in other cases respondents chose not to answer some questions].
I'm not sure what they mean by "ignorable", other than to say whether it's a big deal or not. The key difference is whether it can be assumed that missing data occurs due to a systematic bias that can impact the data we do have.

For example, let's say you have a survey where you ask for people's income. You can expect some (about 1 in 5 in my experience) to refuse to answer. Why would they refuse to answer? Could it be because they have a low income and are worried about negatively perceived? Could it be because they have a high income and don't want to attract attention? Or could it because they don't think it's any of our business?

If we can rationalize that the answer is yes for either of the first two questions, than we say that the data is not at random because there is an inherent bias. If instead you think only the third question's answer is yes, then you can rationalize that this is a belief that could be independent of their actual income levels, and is therefore truly at random.

How we determine that is where it gets tricky, and as far as I know, involves a lot of hand-waving. It's like trying to understand dark matter - you don't know what you don't know, so at best you present a hypothesis and your rationale for that hypothesis, and it's for others to judge whether that is reasonable.


Fortran must die
When you have data that is missing and not monotone [which I am told is the norm] then you have two options. One is to make it monotone [sas uses markov chain monte carlo to do this]. And then use monotone approaches. The other is to use one of the missing processes that work with arbitrary missing data and not make the data monotone error.

Which is the best option?

One common question about imputation is whether the dependent variable should be included in the imputation model. The answer is yes. If the dependent variable is not included in the imputation model, the imputed values will not have the same relationship to the dependent variable that the observed values do. In more practical terms, if the dependent variable is not included in the imputation model, you may be artificially reducing the strength of the relationship between the independent and dependent variables. After the imputations have been created, the issue of how to treat imputed values of the dependent variable becomes more nuanced. If the imputation model contains only those variables in the analysis model, then using the imputed values of the dependent variable does not provide additional information and actually introduces additional error (von Hippel 2007).
I don't understand the last two sentences. Any suggestions?


Less is more. Stay pure. Stay poor.
This is always an interesting topic, for which I am not well versed. My field (epidemiology) publishes quite a bit on these approaches and techniques (see American Journal of Epidemiology, Epidemiology, and International Journal of Epidemiology). I have not had the opportunity to really learn the core or finite details. At a conference I attended last year they had a panel talking about missing data (I will assume for this discussion, MAR). A person in the audience asked about if most of those data were missing, above and beyond say 10%. The presenters (mostly a bunch of Ivy League leaders in the analytics) stated that in the imputation process you use every and anything. So even variables not in your model, all generally relevant data, even if only a small portion of data for a variable is available, use it.

I guess what I am trying to allude to is that if you only use variables in your final model which has incomplete data already, then you may be really restricting the process.


Fortran must die
That actually is another debate related to what I posted [in the same section, but I did not raise it to keep things simple].

I will commonly use this for our own surveys. So the number of variables will be small and all logically included in the analysis. I don't know enough yet to form an opinion on what you noted hlsmith.

I already have a long, 14 page single spaced, document on this and am just really starting. If you want the document when it is done (it will be in SAS primarily of course) I will be happy to send it. It might be a while.

A basic question I have is since most data is missing records and most who study this agree the results of missing data are biased indicators unless you are very lucky and the data is MCAR (which it probably won't be) are any analysis we do valid?

That is a painful thought for me. :(


Less is more. Stay pure. Stay poor.
Thanks for the offer - yes I would be interested and seeing your notes when they are better figured out. Unfortunately, my area of research typically run the data all ways, (without imputation, full impution, etc.) and then call it sensitivity analysis. How sensitive are the results when different approaches are used. Which this may not be an approach for you.


Fortran must die
Well that is sensitivity analysis :p but seems like a lot of work. And you really don't know which is "right" :)

The real problem I am afraid is if I did that my (non-quantiative) superiors would first 1) get bored and 2) assume I had no idea what I was doing and lose interest in the analysis.


Less is more. Stay pure. Stay poor.
Agreed. I just looked down at thejournals on my desk and saw:

Am J Epidiol. 2014;180(9):920-932. It may have some good references to help guide you.


Fortran must die
So are all the data research run without MI wrong enough to matter? Because I am guessing that is most in social sciences (and everything I ever did).

In no stats class I ever had (and that is four graduate programs including one master's in Measurement and Statistics) was missing data ever raised.


Fortran must die
This is the best site I have found so far on this topic and I strongly recommend it.


I have a question on this point.

Third, many statistical programs assume the multivariate normal distribution when constructing l(θ|Y). Violation of this multivariate normality assumption may cause convergence problems for EM, and also for other ML-based methods, such as FIML.
Commonly you add auxiarly variables to FIML From the same link:

Because FIML assumes MAR, adding auxiliary variables to a fitted model is beneficial to data analysis in terms of bias and efficiency ( Graham 2003; Section titled The Imputation Model). Collins et al. (2001) showed that auxiliary variables are especially helpful when (1) missing rate is high (i.e., > 50%), and/or (2) the auxiliary variable is at least moderately correlated (i.e., Pearson’s r > .4) with either the variable containing missing data or the variable causing missingness.
My questions are 1) does the multivariate normality apply just to your analysis model [which commonly won't include auxillery variables] or to the analysis and auxilary variables? Also 2) when you run FIML do you really leave the auxilary variables in the model? Generally you don't want extra variables in a regression model for theoretical and practical reasons. In MI, if I understand this correctly your include the auxilary variables in the first stage when you generate multiple imputations, but remove it in the second when you are estimating parameters with your chosen method (say regression or ANOVA).

Although I am not sure of that last point either. :p


Less is more. Stay pure. Stay poor.
l(θ|Y), not my area, what is the first character representing, right before "("?

I agree with your last state. Don't include them in final model unless they are relevant. Side note again, not an expert in this area.


Fortran must die
I probably should make that disclaimer in ever post of mine :p

It is a parameter, I forget which one exactly. In practice it does not matter the only thing important in that sentence is the requirement of multivariate normality.

Which brings me to this comment:

We replicate the multiple imputation example from the book, section 6.5. In that example, we used the mcmc statement for imputation: at the time, this was the only method available in SAS when a non-monotonic missingness pattern was present. We noted at the time that this was not "strictly appropriate" since mcmc method assumes multivariate normality, and two of our missing variables were dichotomous.
I don't understand why it matters if two variables are dichotomous, because for example dummy variables don't eliminate multivariate normality automatically [unless they are the DV].

Hi- I am not a statistician, but this is my understanding of multiple imputation.

Firstly, I would regard missing data and how you deal with it as crucial!

MCAR (missing completely at random)= missing data is truly missing at random. A relatively straightforward way of thinking about this is as a random sample of the complete data. For example, imagine that for every variable you rolled a die, and if it was 6 you removed that variable.

Missing at random- this is confusing because it does not really mean missing due to a random process. It really means that the missing data may depend on variables that are observed. In other words, based on the data you have, you can make predictions about what the missing data would have been. MAR is an example of an “ignorable” likelihood-based method. (AKA ignorable: “Given MAR, a valid analysis can be obtained through a likelihood-based analysis that ignores the missing value mechanism, provided the parameters describing the measurement process are functionally independent of the parameters describing the missing process.”) From the data alone you are not can be able to distinguish between missing at random and missing at not at random mechanisms.

If I were you, I would do as you have done and make the assumption that data was missing at random, and then you can do a sensitivity analysis to check if the difference.

Some of the questions that have been completed may well make predictions about the incomplete answers. Sensitivity analysis is not actually a lot more work, because something like SPSS will do a complete case analysis on the original data as well as the imputed data sets and pool the datasets too.

“If the imputation model contains only those variables in the analysis model, then using the imputed values of the dependent variable does not provide additional information and actually introduces additional error (von Hippel 2007).”
Let's say you are trying to predict what type of car someone's going by, and you know that the predictors are height, age, gender, and socio-economic class. If you have missing data on social economic class, you should not just use height, age and gender when imputing missing data. You should also use other data that you have collected even if it is not in your "model", because they may be important in imputing the missing data, even if they are not related to the outcome. As stated by hlsmith pretty much use all variables to impute missing data.

If you do MI in SPSS it will run linear and logistic regression on variables (dichotomous variables are fine). In the final model you should only use those variables that are relevant to your outcome. This may include variables that were complete and were used to in the MI model. All imputed datasets should be analysed, and the results pooled as per Rubin's method:


Hope that is of some help!


Fortran must die
MAR is required for most methods designed to deal with missing data. But, with rare exceptions, there are no test to determine if the data is MAR or MNAR. Most likely you would have to know why the responder did not respond to tell this :p

So in the end you hope that the data is MAR.


Less is more. Stay pure. Stay poor.
Updates, per my post #7, I saw this link:


However, I am not sure how this applies to arbitrary missingness of categorical data, which I currently have. Also, after reading this thread I noticed it lack reference to Little's test for MCAR (which SAS has a macro for) and subsequently following up that procedure with performing a dummy coded missingness logistic regressions to see if dataset variables predict the probability of missingness.


Fortran must die
Sensitivity analysis is recommended for MNAR data, but really does not correct for it. It just shows what problems might result if you have it. Comments I have read in various sorces suggest the test really won't tell you in most cases if the data is MNAR (they exist, but won't really work in most cases).

As is always true I have collected a tome on this topic from on line sources. When I have some more time I will post some of the links.


Fortran must die
This is for multiple imputations....

"The MCMC algorithm makes the assumption the underlying variables in the imputation model are distributed as a multivariate normal random variable....In the case of continuous variables that are highly skewed or otherwise non-normal in distribution, PROC MI currently enables the user to specify transformations..."
So why would univariate non-normality automatically lead to multivariate non-normality and why would a univariate transformation deal with this problem (multivariate non-normality) if it in fact existed?


Less is more. Stay pure. Stay poor.
I believe combine normal with normal results in normal but this might not be the case with other combos.


Ambassador to the humans
If you have multivariate normal then any marginal distribution will be normal as well. So if you don't have a marginal (univariate) normal then you CANNOT have multivariate normal.

The transformations don't guarantee that you'll end up with normality but sometimes give you something close to normal.