ANOVA/Regression

#1
Hi all,

First timer here. I am hoping this is the right forum and that someone could point me in the right direction.

I am analyzing a dataset which contains over 4500 points. Almost all the independent variables are indicator variables. It so happens that at some levels of the variables, there is no recorded response. For eg., a cross tabulation between Organic and Preservative gives me -

Code:
Rows: Organic   Columns: Preservative

          1     2   All

1       158     0   158
2      1042  2333  3375
3         0  1065  1065
4        60    14    74
All    1260  3412  4672
As you can see I have 0 responses for 2 set of combinations. Because of this, I can not run an ANOVA or a regression model which contains the interaction term between organic and preservative. It is important for our research to look at the interaction terms!

So any advice on how to go about it? My stats prof said there maybe ways to add additional constraints to make the model run, but I wasn't able to find any. Alternatively, I am also open to the idea of running a different statistical analysis to include the interaction term.

Any help would be appreciated :) Thanks, in advance.
 
Last edited by a moderator:

ted00

New Member
#2
By 4500 points you mean that's the sample size? how many independent variables? By the way, is this a food safety problem? Options include re-categorizing the variables to fewer levels. I know of models that use emperical Bayesian methods for data similar to this, where there are many 0-cells; the idea being the problems are mitigated by a "pooling" of info across all cells ... some buy it more than others, though, is all I can say about that. Is the dependent variable also a count, or continuous?
 
#3
By 4500 points you mean that's the sample size? how many independent variables? By the way, is this a food safety problem? Options include re-categorizing the variables to fewer levels. I know of models that use emperical Bayesian methods for data similar to this, where there are many 0-cells; the idea being the problems are mitigated by a "pooling" of info across all cells ... some buy it more than others, though, is all I can say about that. Is the dependent variable also a count, or continuous?
Thanks for getting back to me! Yes, 4500 is the sample size. And it is related to the food industry though it isn't a food safety issue. I have about 9 independent variables, all qualitative like organic, preservative, sweetner etc.
The dependent variable is time. Eg, Cook time!

I have very limited knowledge wrt Bayesian analysis but I would like to mention that I have pooled as much as possible. For eg, from having 4 different types of organic stuff, I pooled it down to 2 levels. Similarly I pooled the size of fruits from 9 to 4. Is that what you meant by Bayesian Analysis? But pooling beyond this may lead to a loss of information that is crucial for our study.

Thoughts?
 

ted00

New Member
#4
nah, I was talking about this

did pooling to fewer levels eliminate the zero's? If predicting cooking time is the primary interest, rather than, say, testing factors, I think I'd try leaving the factor levels as they were originally (having the zeros); the model will still fit.
 
#5
nah, I was talking about this

did pooling to fewer levels eliminate the zero's? If predicting cooking time is the primary interest, rather than, say, testing factors, I think I'd try leaving the factor levels as they were originally (having the zeros); the model will still fit.
Ah yes! Pooling did eliminate the zeroes and you are right, the model still fits if I don't pool. What I cannot estimate, however, is certain interaction terms like organic*preservative. (if I don't pool)

So I'm guessing the answer is as simple as pool them? Well, thank you :)
 

maartenbuis

TS Contributor
#6
The zero observations on such a large dataset might mean that they are structural zeros. An example is pregnant males. I can see how the combination of organic and preservative could be considered as mutually exclusive. If that is the logic used when collecting the data, then it is logically impossible for such an interaction effect to exist, and it is hard (this is an understatement) to estimate something that does not exist. The solution is not to look for ever more complicated techniques but to take a step back and question your data and the research question again.
 
#8
The zero observations on such a large dataset might mean that they are structural zeros. An example is pregnant males. I can see how the combination of organic and preservative could be considered as mutually exclusive. If that is the logic used when collecting the data, then it is logically impossible for such an interaction effect to exist, and it is hard (this is an understatement) to estimate something that does not exist. The solution is not to look for ever more complicated techniques but to take a step back and question your data and the research question again.
That makes sense! Thank you :)