ANOVA with multiple imputed Data

#1
Hello :)

I used multiple imputation on my data to get a complete data set. I want to do a ANOVA now. Does anybody know how to do that correctly?

SPSS calculates ANOVAS for every single imputation group but does not pool the results. Some of my imputation groups are significant (e.g. 0,04) and some aren't (e.g. 0,07).

There is some small literature about pooling multiple imputed data but I don't understand it...(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4029775/)

Thanks in advance!

Froop
 

hlsmith

Omega Contributor
#2
The process is actually much easier than you probably think, based on Rubin's approach.


You average the estimates from imputation based analyses and that gets you the estimate value across imputes.


Then the SE within and between imputations based analyses values. The section in your link about single pooling covers this. So the estimate is super easy to get, then you create the SE based on within and between imputation variability. This makes the SE measure a little larger since it takes into account the slight variability accounted for between imputes, since it is probability based.
 

spunky

Smelly poop man with doo doo pants.
#4
SE = standard error?

So I just calculate the average of "everything"?^^

Thanks!
Not everything. The only thing you can "average" (as in just taking the mean) are the parameter estimates you obtain (regression coefficients, correlation coefficients, etc.). That would be 'Q' on your link in Eq (1). Then you need to calculate the within- and between- imputation variance, get the test statistic manually, etc.

Overall its is quite a drag to do. But I just wanted to remind you that this is not as easy as just "taking the average of everything" or "averaging the datasets and running the analysis on it". It's a little more complicated than that.
 
#6
Just to be sure i try to make an example:



Original Data:

Participant 1: 5 4 3 -
Participant 2: - 2 1 1
Participant 3: 1 2 4 -


Imputation 1:

Participant 1: 5 4 3 2
Participant 2: 2 2 1 1
Participant 3: 1 2 4 3


Imputation 2:

Participant 1: 5 4 3 1
Participant 2: 2 2 1 1
Participant 3: 1 2 4 1


So I average the Imputations


Participant 1: 5 4 3 1,5
Participant 2: 2 2 1 1
Participant 3: 1 2 4 2


And based on this averaged imputation sheet I calculate within and between variance (by hand)? If this is correct, how can I tell SPSS to average the imputations? As far as I know, SPSS keeps all imputations separated and only gives pooled results on some calculations like "frequency" but doesn't pool the data itself)

Is it better to import my data to excel to be able to calculate properly or can do Spss all the calculations?

thx for help and sorry for the question but I am still confused in the topic :confused:
 
Last edited:

hlsmith

Omega Contributor
#7
You average your estimates, I am guessing you don't have that many impute sets, so just put them in a new data frame and ask SPSS to average.


As, Spunky reiterated the variance part requires the formula in the paper. Yes, SE = standard errors.
 

spunky

Smelly poop man with doo doo pants.
#8
Just to be sure i try to make an example:



Original Data:

Participant 1: 5 4 3 -
Participant 2: - 2 1 1
Participant 3: 1 2 4 -


Imputation 1:

Participant 1: 5 4 3 2
Participant 2: 2 2 1 1
Participant 3: 1 2 4 3


Imputation 2:

Participant 1: 5 4 3 1
Participant 2: 2 2 1 1
Participant 3: 1 2 4 1


So I average the Imputations


Participant 1: 5 4 3 1,5
Participant 2: 2 2 1 1
Participant 3: 1 2 4 2


And based on this averaged imputation sheet I calculate within and between variance (by hand)? If this is correct, how can I tell SPSS to average the imputations?
NO! This is *exactly* what we warned you not to do! What you should be doing looks more like:


Original Data:

Participant 1: 5 4 3 -
Participant 2: - 2 1 1
Participant 3: 1 2 4 -


Imputation 1: <---- RUN ANOVA HERE, GET PARAMETER ESTIMATES (WE'LL CALL THEM Q1)

Participant 1: 5 4 3 2
Participant 2: 2 2 1 1
Participant 3: 1 2 4 3


Imputation 2: <---- RUN ANOVA HERE, GET PARAMETER ESTIMATES (WE'LL CALL THEM Q2)

Participant 1: 5 4 3 1
Participant 2: 2 2 1 1
Participant 3: 1 2 4 1


Now you have two vectors of parameter estimates, Q1 and Q2. You average Q1 and Q2 to get the parameter estimates on which you will do your hypothesis tests, you pool the variances and standard errors of Q1 and Q2 to get the correct within- and between- imputation variance and finally you get the F-statistic that you want. Everything by hand following Eq. 1 - 6 of the document you attached.

Notice that, as shown in the example of the article you attached, you'll need to reframe the ANOVA as a multiple regression so you'll need to ask it for the regression equation to get the regression coefficients and R-squared (whose F-test is statistically equivalent to the F-test you get by taking ratios of Mean Squares.

Here's my honest opinion. If you're dealing with missing data switch software programs. SPSS makes things so unnecessarily complicated that it almost makes you wonder why they bothered only giving you half of the missing data routine.
 
#10
Here's my honest opinion. If you're dealing with missing data switch software programs. SPSS makes things so unnecessarily complicated that it almost makes you wonder why they bothered only giving you half of the missing data routine.
Which one do you prefere then? Stata? R?

You average your estimates, I am guessing you don't have that many impute sets, so just put them in a new data frame and ask SPSS to average.
Unfortunately I have 20 Imputations :/


Actually I try to test 2 effects on 3 outcomes at 3 brands (MANOVA) but I think I just do several ANOVAS.

If I just make a ANOVA of one of the parts, SPSS gives me the following:
https://www.imageupload.co.uk/image/DFuV


Just post the output for the m analyses and make Spunky do it for you!
Would be the easiest but its my exam project so I have to do it^^
Maybe if i switch to a programm that gets it done for me it will get easier...

Usually I'm not that bad at math but these eq. (1) - (6) look some kind of difficult... don't know why... maybe if some1 could give me a small calculation-example... I can do the rest then :)


Thanks a lot, guys :)
 
Last edited:

hlsmith

Omega Contributor
#11
I haven't done it with R yet, but I used SAS (i.e., PROC MIANALYZE) and it is as easy as inputting values.
 
Last edited:

spunky

Smelly poop man with doo doo pants.
#12
I use the mice package in R. But I know STATA also has good missing-data handling capabilities so whichever one you think is easier for you I guess.

Would be the easiest but its my exam project so I have to do it^^
If this is any exam project, didn't they teach you in school how do to it then before they let you do it yourself? I'm just wondering if maybe you have something on your notes on how do to this stuff and then you won't need to switch software or anything.
 
#13
If this is any exam project, didn't they teach you in school how do to it then before they let you do it yourself? I'm just wondering if maybe you have something on your notes on how do to this stuff and then you won't need to switch software or anything.
The answer is always yes, despite what many students tell you, barring any crappy for-profit schools and some community colleges where I have seen this happen. That's the minority, though. On occasion, I've seen professors assign a project with the intention of students completing parts as the material is covered in class.
 

spunky

Smelly poop man with doo doo pants.
#14
The answer is always yes, despite what many students tell you, barring any crappy for-profit schools and some community colleges where I have seen this happen. That's the minority, though. On occasion, I've seen professors assign a project with the intention of students completing parts as the material is covered in class.
Well, when I’ve taught or TA’d I’ve seen one of two things happening, depending on the type of project.

One is you give the students a dataset with the issues/kinks covered in class so you can see if they’re able to recognize them and address them. The other is you let students do their own project with their own datasets and then the kinks and peculiarities of the dataset reveal themselves as the project goes along. When you find yourself in the latter situation is when the students may struggle a little bit more because you can’t possibly cover every single data issue in an introductory class (like how to handle missing data or what to do if you have a truncated variable, etc.) and they get lost trying to figure things out themselves. So I feel like whereas in scenario #1 you just tell the person “go look it up on your notes” in scenario #2, as an instructor, it’s more like “wow, good job for recognizing this as a problem and trying to fix it yourself”. I tend to work in the latter scenario (people are more interested in analyzing their own data than whatever you can give them) and a lot of the material that’s covered in my classes has now changed because of it. But it obviously demands more of you as an instructor because you need to look after as many datasets as people are in your class.
 
#15
Well, when I’ve taught or TA’d I’ve seen one of two things happening, depending on the type of project.

One is you give the students a dataset with the issues/kinks covered in class so you can see if they’re able to recognize them and address them. The other is you let students do their own project with their own datasets and then the kinks and peculiarities of the dataset reveal themselves as the project goes along. When you find yourself in the latter situation is when the students may struggle a little bit more because you can’t possibly cover every single data issue in an introductory class (like how to handle missing data or what to do if you have a truncated variable, etc.) and they get lost trying to figure things out themselves. So I feel like whereas in scenario #1 you just tell the person “go look it up on your notes” in scenario #2, as an instructor, it’s more like “wow, good job for recognizing this as a problem and trying to fix it yourself”. I tend to work in the latter scenario (people are more interested in analyzing their own data than whatever you can give them) and a lot of the material that’s covered in my classes has now changed because of it. But it obviously demands more of you as an instructor because you need to look after as many datasets as people are in your class.
We gave the illusion of choice in our class. Students could pick any data set they desired, so long as it was from our pool of 4-6 pre-approved sets :D... it helped us focus the scope to what we had taught. Somehow students always came into the TA lab hours saying "We didn't do this in class!" Then, I would show them in their notebook or the course notes where we did it. You're a bit more bold since you let them pick any data set they want. ;)
 
#16
If this is any exam project, didn't they teach you in school how do to it then before they let you do it yourself? I'm just wondering if maybe you have something on your notes on how do to this stuff and then you won't need to switch software or anything.

Its for my master thesis but its not a "usual" thesis, its in a kind of free research project. We got told the basics but I faced some missing values and used MI which was not told in the courses...
 

spunky

Smelly poop man with doo doo pants.
#17
Its for my master thesis but its not a "usual" thesis, its in a kind of free research project. We got told the basics but I faced some missing values and used MI which was not told in the courses...
Well... to be honest, if you're not very familiar with a somewhat complex method like Multiple Imputation, I'd see more value as a professor in you (a) acknowledging that this is a limitation of your research question and (b) mentioning that you researched potential solutions. Running Multiple Imputation it's not just a point-and-click type of analysis. You need to check to see whether your MCMC chains or imputation algorithm converged properly and run some imputation diagnostics. You'll probably want to calculate and report the fraction of missing information statistic and comment on whether or not that could influence your results (i.e. if it is high then you're mostly modelling the imputation as opposed to the actual data). And that's just off the top of my memory of the things you'd need to check. I know there are more but it's a Saturday here and my brain doesn't want to work.

So...yeah. I guess the overall recommendation here is to proceed cautiously if you have never learned how to do this.
 

hlsmith

Omega Contributor
#19
Agreed Katxt, if they weren't so darn close to just doing it. They have the formula and everything!


I would wonder what the trade-off would be for "power". Using missing data, their sample size will be smaller, but if the use MI their SE will be larger. Though, it always seems statisticians like to say do the MI. But knowing the source of missingness always seems dubious like specifying a model close enough to work with the data generating process and its probability distribution.
 
#20
Yes, you don't get anything for nothing. It's hard to see how you can get more accurate results just by making up more data that summarizes the data you already have. With imputation you need to reduce the error df anyway, so your sample size isn't really any smaller with the GLM (I think.)