# How do I deal with my data after multiple imputation cycles?

#### Paige

##### New Member
I have a data set with 40 continous variables which have around 30% missing observations.

I imputed the missing data using predictive mean matching through the MICE package in R, I did 5 cycles for each imputation.
My question is how do I choose which cycle to use for the whole dataset?

I know that for each variable it should be as close as possible to the original mean, however say I choose the 5th iteration for one variable- as I can see the new mean is the closest to the original mean- that same iteration may not be the best one to choose for the other 39 variables.

So do I have to do separate imputations for each of the 40 variables on 40 separate datasets, export those datasets and then combine them all at the end?

Or is it safe to just choose any of the iterations and apply it to all 40 variables in one go?

Disclaimer I have a very basic stats training and am coming from the field of Linguistics not stats, so I'm in way over my head.

Note: I can't pool the data into a model because this specific dataset only contains my dependent variables. (They are in a different format to my independent variables, so I need to first impute the data and then average each variable, input those averages into another dataset with the predictor variables, and then finally run a model with those averages and my predictor variables

#### hlsmith

##### Less is more. Stay pure. Stay poor.
What statistic are you hoping to eventually calculate?

The traditional approach is to calculate that state in each imputed data set and pool those. This way the results include the within and between dataset variability and aren't too over-confident. The pooling approach be call something in regards to "Little".

#### Paige

##### New Member
It's a bit complicated because of the way data is set up.

My dependent variables are measures of participants response time and accuracy when reading different words (20 words in total, so 20 accuracy scores and 20 response time scores) for which I have around 150 observations. This is where the missing data is..

However my independent variables are measures of word factors e.g. word length, word frequency. So I have 20 words each with those variables.

So my end goal is to run a multiple regression to see the predictive value of word factors for response time and accuracy. (E.g. can word length predict variance in response time of that word)

However this is my issue: I have to first average the 150 observations for response time for each word. And then use this average in my regression model.

So I can't directly pool the imputed data into the model. I need to first average those variables....wondering if this is even possible now or if I should revert to step wise deletion

#### spunky

##### Can't make spagetti
So I can't directly pool the imputed data into the model. I need to first average those variables....wondering if this is even possible now or if I should revert to step wise deletion
What do you mean by "directly pool the imputed data"? As @hlsmith said, you never ever, ever, ever as in NEVER EVER "pool the data". What you pool are the results from running your analyses on each imputed dataset according to Rubin's Rules for Multiple Imputation.