What will this variable be in your regression model, an independent variable? Also, I am not seeing how the above example data is an imputed categorical variable?
I am working as part of a team on a relatively large dataset which has been subject to imputation analysis.
One of my colleagues has pointed out that the when carrying out regressions that can provide odds ratio/event ratios etc, we first need to have the event rates from each variable in integer form. However, because of the imputation analysis we have been left with non-integer data for many variables. For example, our imputed data shows that 72.1/401 of our participants were in a relationship.
My questions are as follows:
1. Is it indeed the case that event rate/frequencies need to be in integer form before regression analysis can be run?
2. If this is the case, can we (either manually or through the use of an SPSS function) adjust the data to correct this issue?
I am aware that there is a function for doing as part of the imputation analysis, however re-running the imputation analysis is something we would like to avoid.Also, in case it’s relevant, it’s primarily logistic regressions that we will be running, though we may also use linear regression as well.
Finally, I’d just like to point out that I’m not particularly comfortable imputing categorical variables such as the one mentioned above, but this was a decision made by others in my team.
Thanks
Michael
What will this variable be in your regression model, an independent variable? Also, I am not seeing how the above example data is an imputed categorical variable?
Stop cowardice, ban guns!
123Michael456 (07-13-2015)
Thanks for your response hlsmith
I intend to run a lot of different regressions but I'll give you an example of one.
Overall we are interested in the impact of particular variables on the life outcomes of individuals with severe learning difficulties. So in one case we will look at the the impact that age, gender, geographic location, and relationship status (IV's) have on employment status (DV). This will be a logistic regression where relationship status breaks down into not in a relationship (0) and in a relationship (1), and employment status similarly breaks down into unemployed (0) and employed (1).
In relation to your second point - the number of individuals in a relationship was initially 59/401, and as a result of multiple imputation, this figure rose to 72.1. If it is useful, here is the type of imputation that was used - http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/
Thanks again
So in your example you imputed a binary variable (i.e., 0 or 1), so you should not have a value of 72.1. Perhaps if you averaged all of the imputed sets for that binary variable you could have that value. Though, you will run the logistic regression with each of the imputed sets (e.g., m=20), then after running 20 logistic regression models you will pool all of the results together and report those results, for descriptive states you could use a percentage number, but that is not what you will actually be running inferential stats with.
Stop cowardice, ban guns!
123Michael456 (07-13-2015)
Quote from your referenced paper:
Once the data have been imputed, each imputed dataset is “complete” in the sense that it has no missing values (except those missing by design). Analyzing multiply imputed data involves two steps: 1) running a standard analysis (e.g., regression) on each of the imputed datasets, and 2) combining the estimates from each dataset to obtain the final result. The variance estimates calculated in Step 2 involve both the “within” variance calculated for each dataset individually, as well as the “between” variance that reflects the uncertainty in the imputations—how variable the results are across the imputed datasets. The formulas for combining coefficients and estimates in Step 2 are provided in Schafer and Graham (2002). While it is possible to write a short computer program to do the combining, many standard statistical software packages include procedures to combine results across datasets automatically. Thus, from the user’s perspective doing these two steps and obtaining the final estimates is often no more complicated than running a single regression in a single dataset.
Stop cowardice, ban guns!
123Michael456 (07-13-2015)
Thanks again hlsmith.
In response to your last two responses, yes this is the process that was followed. So, figure I quoted (72.1), was actually a pooled frequency figure. I should have explained that part earlier sorry. The problem my colleague had with this is that he argued that as these frequencies (essentially event occurances) inform regression analysis, it would be problematic if they were not whole numbers.
Why would it be problematic to use the imputed dataset for the inferential statistics?
Again, your help is much appreciated
I think you may be missing just a basic concept here (or perhaps I am not understanding your description) I will go slow here, in that it is my assumption that non-integers are not being used!
You imputed multiple datasets correct?
Stop cowardice, ban guns!
So if you open each of the 20 datasets, examining them individually, you have no percentage counts for employment status. All data should look like 0/1. Correct!
Now you run a unique logistic regression model for each dataset. Thus no percentages ever go into the logistic model. Your question is moot.
You subsequently pool the 20 logistic model results. So say you get an odds ratio, it will have a straight forward interpretation (e.g., unemployed subjects have a 2.5 times greater odds for outcome X than employed subjects).
Does this clear things up. Your program may kick out pooled percentage of employment for the 20 datasets (71.6%), however, that number is never used in calculations.
Stop cowardice, ban guns!
123Michael456 (07-13-2015)
Yes, the decimal is irrelevant in the logistic procedure.
No, you don't pool data, you pool results from the 20 logistic regressions. If you think about it, you would be inflating power if you pooled the datasets, now instead of having say a sample size of 100 you have a sample size of 2,000 if you pool the samples (you would have higher degrees of freedom as well).
Keep asking questions, I don't mind answering them, since I did this same thing for the first time about a month ago.
Stop cowardice, ban guns!
123Michael456 (07-13-2015)
That's excellent thanks. I thought that would be the case, but I think my colleague had confused me a bit on the matter, so thanks for taking the time to talk through it.
Also, I did mean the pooled results from the regression, just used the terminology incorrectly, apologies.
Well, thank you very much for your time and your advice. If I can repay you in anyway then please let me know.
Thanks again
Much appreciated!
Michael
I hope the project goes well!
Now go tell your colleague to quit confusing everybody.
Stop cowardice, ban guns!
Tweet |