- Thread starter noetsi
- Start date

For example, let's say you have a survey where you ask for people's income. You can expect some (about 1 in 5 in my experience) to refuse to answer. Why would they refuse to answer? Could it be because they have a low income and are worried about negatively perceived? Could it be because they have a high income and don't want to attract attention? Or could it because they don't think it's any of our business?

If we can rationalize that the answer is yes for either of the first two questions, than we say that the data is not at random because there is an inherent bias. If instead you think only the third question's answer is yes, then you can rationalize that this is a belief that could be independent of their actual income levels, and is therefore truly at random.

How we determine that is where it gets tricky, and as far as I know, involves a lot of hand-waving. It's like trying to understand dark matter - you don't know what you don't know, so at best you present a hypothesis and your rationale for that hypothesis, and it's for others to judge whether that is reasonable.

Which is the best option?

One common question about imputation is whether the dependent variable should be included in the imputation model. The answer is yes. If the dependent variable is not included in the imputation model, the imputed values will not have the same relationship to the dependent variable that the observed values do. In more practical terms, if the dependent variable is not included in the imputation model, you may be artificially reducing the strength of the relationship between the independent and dependent variables. After the imputations have been created, the issue of how to treat imputed values of the dependent variable becomes more nuanced. If the imputation model contains only those variables in the analysis model, then using the imputed values of the dependent variable does not provide additional information and actually introduces additional error (von Hippel 2007).

I guess what I am trying to allude to is that if you only use variables in your final model which has incomplete data already, then you may be really restricting the process.

I will commonly use this for our own surveys. So the number of variables will be small and all logically included in the analysis. I don't know enough yet to form an opinion on what you noted hlsmith.

I already have a long, 14 page single spaced, document on this and am just really starting. If you want the document when it is done (it will be in SAS primarily of course) I will be happy to send it. It might be a while.

A basic question I have is since most data is missing records and most who study this agree the results of missing data are biased indicators unless you are very lucky and the data is MCAR (which it probably won't be) are any analysis we do valid?

That is a painful thought for me.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3701793/

I have a question on this point.

Third, many statistical programs assume the multivariate normal distribution when constructing l(θ|Y). Violation of this multivariate normality assumption may cause convergence problems for EM, and also for other ML-based methods, such as FIML.

Because FIML assumes MAR, adding auxiliary variables to a fitted model is beneficial to data analysis in terms of bias and efficiency ( Graham 2003; Section titled The Imputation Model). Collins et al. (2001) showed that auxiliary variables are especially helpful when (1) missing rate is high (i.e., > 50%), and/or (2) the auxiliary variable is at least moderately correlated (i.e., Pearson’s r > .4) with either the variable containing missing data or the variable causing missingness.

Although I am not sure of that last point either.

It is a parameter, I forget which one exactly. In practice it does not matter the only thing important in that sentence is the requirement of multivariate normality.

Which brings me to this comment:

We replicate the multiple imputation example from the book, section 6.5. In that example, we used the mcmc statement for imputation: at the time, this was the only method available in SAS when a non-monotonic missingness pattern was present. We noted at the time that this was not "strictly appropriate" since mcmc method assumes multivariate normality, and two of our missing variables were dichotomous.

http://www.r-bloggers.com/example-9-4-new-stuff-in-sas-9-3-mi-fcs/

Firstly, I would regard missing data and how you deal with it as crucial!

MCAR (missing completely at random)= missing data is truly missing at random. A relatively straightforward way of thinking about this is as a random sample of the complete data. For example, imagine that for every variable you rolled a die, and if it was 6 you removed that variable.

Missing at random- this is confusing because it does not really mean missing due to a random process. It really means that the missing data may depend on variables that are observed. In other words, based on the data you have, you can make predictions about what the missing data would have been. MAR is an example of an “ignorable” likelihood-based method. (AKA ignorable: “Given MAR, a valid analysis can be obtained through a likelihood-based analysis that ignores the missing value mechanism, provided the parameters describing the measurement process are functionally independent of the parameters describing the missing process.”) From the data alone you are not can be able to distinguish between missing at random and missing at not at random mechanisms.

If I were you, I would do as you have done and make the assumption that data was missing at random, and then you can do a sensitivity analysis to check if the difference.

Some of the questions that have been completed may well make predictions about the incomplete answers. Sensitivity analysis is not actually a lot more work, because something like SPSS will do a complete case analysis on the original data as well as the imputed data sets and pool the datasets too.

“If the imputation model contains only those variables in the analysis model, then using the imputed values of the dependent variable does not provide additional information and actually introduces additional error (von Hippel 2007).”

Let's say you are trying to predict what type of car someone's going by, and you know that the predictors are height, age, gender, and socio-economic class. If you have missing data on social economic class, you should not just use height, age and gender when imputing missing data. You should also use other data that you have collected even if it is not in your "model", because they may be important in imputing the missing data, even if they are not related to the outcome. As stated by hlsmith pretty much use all variables to impute missing data.

If you do MI in SPSS it will run linear and logistic regression on variables (dichotomous variables are fine). In the final model you should only use those variables that are relevant to your outcome. This may include variables that were complete and were used to in the MI model. All imputed datasets should be analysed, and the results pooled as per Rubin's method:

http://sites.stat.psu.edurjls/mifaq.html#howto

Hope that is of some help!

http://statisticalhorizons.com/sensitivity-analysis

However, I am not sure how this applies to arbitrary missingness of categorical data, which I currently have. Also, after reading this thread I noticed it lack reference to Little's test for MCAR (which SAS has a macro for) and subsequently following up that procedure with performing a dummy coded missingness logistic regressions to see if dataset variables predict the probability of missingness.

As is always true I have collected a tome on this topic from on line sources. When I have some more time I will post some of the links.

"The MCMC algorithm makes the assumption the underlying variables in the imputation model are distributed as a multivariate normal random variable....In the case of continuous variables that are highly skewed or otherwise non-normal in distribution, PROC MI currently enables the user to specify transformations..."

The transformations don't guarantee that you'll end up with normality but sometimes give you something close to normal.