- Thread starter knedlica
- Start date
- Tags anova interaction effects non-normal distribution

How many observations do you have in total and in each the cells. I suggest you give a few values of “n” in each cell of the 48 cells.

Even if you had perfectly normally distributed random error terms, (which is almost the same as “residuals”) but also an imbalanced design (an equal “n” in each cell) it is questionable if it is meaningful to estimate interactions.

Maybe someone else has some comments on this?

Is it the same data set you are using in this thread:

http://www.talkstats.com/showthread...rdinal-variables-on-binary-dependant-variable

Is that another project, an other variable of did you reformulate the problem of what would be the dependent variable?

http://www.talkstats.com/showthread...rdinal-variables-on-binary-dependant-variable

Is that another project, an other variable of did you reformulate the problem of what would be the dependent variable?

Last edited:

I have the impression that different users have different opinions about this. I don't want to suggest arbitrary recommendations since I have not seen the data.

So, it would be nice if someone else have some suggestions about this.

If it is exactly balanced it will be “orthogonal” so including or excluding an interaction term will not influence the estimates of the remaining.

Another point is about the plot you provided. What does your residual plot for your full model look like?

But, more to the heart of the matter....we don't know what your data look like, so, as Greta says, I think many of us will be cautious giving specific recommendations. That said, I think most of us agree that models with three-way interactions are a bit tricky. They have to be based on huge sample sizes, but even then interpretation can be difficult. Though I don't entirely recommend this, one option is drop the highest-order interaction from the model (the three-way term) and then conduct a model selection approach on all lower-order models, with the following as your most parameterized (or global) model:

A + B + C + AxB + AxC + BxC

And then these models:

A + B + C + AxC + BxC

A + B + C + AxB + BxC

A + B + C + AxB + AxC

A + B + C + BxC

A + B + C + AxC

A + B + C + AxB

A + B + C

A + B + C + BxC

A + B + C

A + C + AxC

A + C

A + B + AxB

A + B

A + C

A + B

A

(null)

Inference, I think, would be a bit easier and would still be strong. Of course, there are trade-offs with this approach. The only major downside I see is that model selection approaches are generally not philosophically compatible with a designed experiment. But I'm not sure if your factorial design was an experiment or observational study, etc. People, of course, have different opinions on this, so take what I say with a few handfuls of salt.

Last edited:

I started asking about sample size because an imbalanced model can be viewed as “not-acceptable” if the imbalance is severe, even if the data is normally distributed.

If you estimate a full model with 48 parameters there is plenty of room to make the residuals look normal since least squares is maximum likelihood estimates based on the normal distribution.

“The interactions are, btw, all non-significant.

“I wonder what would happen if i combined some groups together thus decreasing the number of cells and enlargening the sample sizes in each cell.”

Some people want to start with the full model, from the top of jpkellys (great) list. Others want to start with just the main effects, from the bottom of jpkelleys list.

One possibility is to include significant terms (if you start from bottom) or to drop non-significant effects (if you start from the top). Then you can plot the normal QQ-plot for the 400 residuals and look if it is on a straight line. If it is on a straight line it is normally distributed.

Having said that I don't want to make any suggestions since I don't want to come with an arbitrary suggestion.

But I believe that knedlica have said that the data are still non-normal. Then it remains a link function to some other distribution or a normalising transformation. And what would be wrong with that?

Stepwise model selection is one route (either from the simplest or from the most complex--global--model). This isn't my favorite approach (either addition or elimination), since I've found that it tends to encourage people to fudge a bit with where to stop their process ("oh, that's p=0.055...that's good enough, I think"). The most list I provided was a list to submit to a model selection approaching using an information criterion like AIC. This allows you to throw all the candidate models in to one pot and let the [AIC] math do the work of choosing the model. Then, given the output, you can commence with model averaging of the top models.

Greta is right that the issue remains that the data are still non-normal. For the unbalanced design and the non-normal data, then I would encourage you to delve into the realm of generalized linear (or non-linear) mixed-effect models. These can manage unbalanced designs fairly well (to a certain limit) and allow specification of whatever link function is appropriate for the distribution of your response data. You still might need a data transformation, but I wouldn't do this before you try one of the link functions. Unless your data are really funky (zero-inflated, or whatnot), you should be golden.

I like AIC = Akaike Informations Criteria, that essentially makes a trade off between the fit (loglikelihood) and the number of parameters p {AIC= -2*(log(L)-p) where L is likelihood and p is number of estimated parameters}

I would still like to have a hierarchy of models. At least the main effects (A, B or C) should be tested if they should be included. And if any higher order interaction effect are included then their corresponding main effects should be included, even if they are non significant.

It would be a very strange model if it only included A+ B*C. If the interaction B*C is included then the main effects B+ C should also be included.

This is in my humble opinion. It is interesting to hear jpkellys and others view.

Still, we haven't seen the data. This is a little bit like sitting on the beach practising swimming. I would like to jump into the water!

Who knows, maybe this is state secrets for CIA or KGB(sic!). But it would be easier if the participants showed their data, in an easy to read format. Then we could discuss the actual data and not practice dry swimming.

1) I am new to residuals and q-q plots so I'm not sure if i did this correctly but i calcučated the residuals for the whole sample using the formula residuals=X-M(x) (the q-q plot is shown below). (there are only 7 values for the dependant variable that is why there are so few dots there i guess). And from what i can see the distribution is almost normal except for the right-end part which suggest a slight negativly assimetrical distribution (the dependant variable is number of simptoms presented after the treatment (0-6) so this makes sense to me). What i want to know is do i have enough arguments for using parametric tests? I compared one-way anova's with kruskal-wallis and U-whitney test and they give the same results.

2) You both made some interesting points regarding the use of stepwise procedures. I know that they are used in regression models, is that what you were talking about or can they be used in ANova's as well and if so can that be done in spss automatically like it can for regression models or do i have to do it myself? and while we are on the subject a more general question: can i detect mediation effects in using anova like i can using hierarchical regression? Because when i use a one-way anova for my independat variable (age) with 6 groups it's signifficant but when i put it in a factorial model with the other independant variable (lenght of treatment) the treatment variable is signifficant but the age variable is no longer signifficant. Is this normal and can I conclude that age doesn't have effect on the number of symptoms but that it nearly effects the length of the treatment (mediator variable) which in turn effects the number of symptoms?

3)is's not a cia project ^^ I just don't know how to present the data to you in a simple and not time-consuming way.