# Thread: Dealing with violations of the assumptions.

1. ## Dealing with violations of the assumptions.

I thought I had this covered in detail, but if so I apparently lost it.

When you discover you have non-normal residuals the first suggestion is commonly logging the Y and if that does not work squaring the Y. What do you do if that does not work (in the context of linear regression). I know there are non-parametric approaches and robust regression. I am trying to find out what you can do if you want to stick to linear regression.

Similarly with unequal equal variance. Is there any type of test other than inspecting the residuals to detect this. I have seen two recomendations to deal with this. First, is a transformation which is simplest. But say that does not work. Several robust standard errors have been suggested (e.g., White's). Does anyone have any preference for any of these?

2. ## Re: Dealing with violations of the assumptions.

Originally Posted by noetsi
When you discover you have non-normal residuals the first suggestion is commonly logging the Y
Is it? It seems like this would be helpful if the errors have a lognormal distribution, but I can't see why this would be a general suggestion. Which sources are you getting this idea from?

If the problem is just with non-normal errors, bootstrapping seems like a good choice if you want to stick with the basic linear regression framework. You can also think about whether you need to do anything at all about the non-normal errors - I know we've said this tons of times, but the coefficients remain unbiased consistent and efficient (BLUE at least) if errors are non-normal but other assumptions are met. The sampling distribuiton of the coefficients won't be exactly normal if the errors aren't normal, which could theoretically muck up significance tests and confidence intervals, but it will converge to a normal distribution with larger sample sizes.

First, is a transformation which is simplest.
Simple to do, but makes interpretation hard. E.g., How do you interpret a regression coefficient after a square root transformation of the Y?

3. ## The Following User Says Thank You to CowboyBear For This Useful Post:

noetsi (04-23-2015)

4. ## Re: Dealing with violations of the assumptions.

It depends on why the residuals are non-normal. Is it due to trying to fit a nonlinear relationship with linear regression? Or due to variances changing with respect to the magnitude of the independent variable? Or with variances changing over time due to a latent variable? The approach taken for each would be different.

In the first scenario, a transformation using Tukey's Ladder of Power may work. In the second scenario, a logarithmic transformation may stabilize the variances, and in the third, adding the latent variable to the model may be necessary.

5. ## The Following 2 Users Say Thank You to Miner For This Useful Post:

hlsmith (04-22-2015), noetsi (04-23-2015)

6. ## Re: Dealing with violations of the assumptions.

Don't forget about transforming the independent variables.

CB, are you referring to bootstapping the sample and rerunning the model and then using the 95 percentile CI from the bootstrapping distribution?

7. ## Re: Dealing with violations of the assumptions.

Originally Posted by hlsmith
CB, are you referring to bootstapping the sample and rerunning the model and then using the 95 percentile CI from the bootstrapping distribution?
Something like that. I didn't really have a specific bootstrap method in mind. The percentile method keeps things simple but there are other ways to do it: http://en.wikipedia.org/wiki/Bootstr...8statistics%29

8. ## Re: Dealing with violations of the assumptions.

CBW when I say common I meant that this is what I have always seen recommended in the links, classes, and other sources I have encountered. I have not seen bootstrapping recommended probably because (like me) most don't know how to do that I agree that interpretation is difficult with transformations. When I conduct analysis I am rarely interested in what the slope of a variable is. I want to know, or those who I run the data for want to know, if the results are signficant or not. The literature I have seen in the social sciences stresses the same, this variable was signficant or was not. So the test not the effect size is the driving force. Which variable is relatively more important is critical to me, but slopes don't tell you that (as dason has reminded me on more than one occasion).

Miner I almost never work with data where there is theory (or theory I have found to date). So knowing if there is a latent variable would be difficult. I was curious how you would know what was the cause of the non-normality.

I was wondering, since this got no comments, if anyone has worked with White's SE.

9. ## Re: Dealing with violations of the assumptions.

noetsi,

Yes, I believe I have seen White's sandwich estimators mentioned. I have seen them too when you may not be able to easily address observation dependence.

10. ## Re: Dealing with violations of the assumptions.

I had thought that SAS only did White in the PROC MODEL which is not ideal for me not only because it uses very different code than I am used to, but because it assumes a non-linear model. However it turns out that you can do these in PROC REG and PROC LOGISTICS by specifying ACOV. If you specify SPEC it performs a test that the results are homoscedastic - which I think is the WHITE test [the sas documentation is a bit unclear here]

ODS graphics on;
PROC REG DATA=WORK.test2
PLOTS(ONLY)=ALL
;
Linear_Regression_Model: MODEL WEEKLYEARNINGS_CLO = WEEKLYEARNINGS_ACC
/ SELECTION=NONE
ACOV
VIF SPEC
;
RUN;
ODS graphics off;
QUIT;

The test of the first and second moment is the way SAS refers to the White test.

11. ## Re: Dealing with violations of the assumptions.

Yeah, I think you can use versions in mixed models as well. I think there is a straight-up "/ white" option in a procedure or at least versions of Robust estimators are available.

Why would robust estimators be used in Logistic, dependence or preferences?

12. ## Re: Dealing with violations of the assumptions.

I have read in a variety of sources that logistic regression does not require equal error variance. But then I ran into this comment in a statistical class,,

Warning: Heteroskedasticity can be very problematic with methods besides OLS. For example, in logistic regression heteroskedasticity can produce biased and misleading parameter estimates.
I was wondering if maybe this was really dealing with unobserved heterogenity although that seems different.

13. ## Re: Dealing with violations of the assumptions.

Originally Posted by noetsi
Miner I almost never work with data where there is theory (or theory I have found to date). So knowing if there is a latent variable would be difficult. I was curious how you would know what was the cause of the non-normality.
I plot the residuals in both time sequence if known and by fits. Heteroskedacity will show up in the latter plot, and if the time sequence shows a shift or a trend, it is a sure sign of a latent variable. However, I do have an advantage working in industrial statistics because the time sequence is usually known and there is usually a physical basis for theory. It is also usually relatively quick and easy to run a confirmation experiment on any predictions.

14. ## The Following User Says Thank You to Miner For This Useful Post:

noetsi (04-23-2015)

15. ## Re: Dealing with violations of the assumptions.

I am not sure what the equivalent test would be for the data I run (which is not an industrial process).

I am not sure what you mean by plotting the residuals in both time sequences or by fits?

16. ## Re: Dealing with violations of the assumptions.

Originally Posted by noetsi
CBW when I say common I meant that this is what I have always seen recommended in the links, classes, and other sources I have encountered. I have not seen bootstrapping recommended probably because (like me) most don't know how to do that
Bootstrapping is pretty well known in the social sciences, e.g., look how many hits come up from a Google Scholar search for bootstrapping psychology.

Implementing bootstrapping would've been hard once upon a time, but for something simple like a regression we're talking clicking a button in SPSS, or a few lines of code in R or SAS.

I am not saying that a log transformation can never be useful, but it is only useful in a very restricted circumstance: I.e., when the logarithm of the distribution of the errors is approximately normal.

When I conduct analysis I am rarely interested in what the slope of a variable is. I want to know, or those who I run the data for want to know, if the results are signficant or not. The literature I have seen in the social sciences stresses the same, this variable was signficant or was not.
I feel like you have a tendency to see something stated in a few sources and then make a premature conclusion about the consensus in the field. There is a huge literature in the social sciences concerned with the limitations of significance testing, and the difference between practical and statistical significance.

Have a look at the results for:

or

On a purely common sense level, do you really not care at all about the size of the relationships between variables? If so, why would you possibly care about whether or not those relationships are nonzero? That is all that statistical significance testing (attempts to) show you.

Which variable is relatively more important is critical to me, but slopes don't tell you that (as dason has reminded me on more than one occasion).
I guess it depends on your definition of "important", but in general the slope tells you a lot more than the p value about the practical importance of the relationship.

EDIT: Sorry if the above seems grumbly - not my intent. I haven't had enough coffee yet

17. ## Re: Dealing with violations of the assumptions.

Not a problem. I have not done research for many years. I did earn a master's in measurement and statistics (if in education) recently and I do spend signficant amounts of time looking at the literature on methods (I have hundreds of pages of typed notes on these topics - one of these is driving this question). What a few is is subject to debate, but I have looked at what seems to me to be quite a lot of sources. Psychology like economics is a branch of the social sciences that is far more concerned with methods than the fields I have worked in (notably administration, but others as well).

I understand the issue of substantive versus statistical signficance. Explaining it to a non-statistical audience is not simple (and in a PHD in public management and several master's I have never seen this topic raised in any of the main journals in my fields).

I will work on learning bootstrapping. I worked in simulation recently in SAS, can't be that much tougher right?

18. ## Re: Dealing with violations of the assumptions.

Originally Posted by noetsi
NI understand the issue of substantive versus statistical signficance. Explaining it to a non-statistical audience is not simple
Sure, communicating what relationship size might be practically significant is difficult, but I can practically guarantee you that the people you're talking to don't understand what statistical significance means it's a complicated concept. If the argument is to keep reporting simple and easy to understand, then there's really no place for significance testing.

I will work on learning bootstrapping. I worked in simulation recently in SAS, can't be that much tougher right?
Yep you'll be fine