# Dealing with violations of the assumptions.

#### noetsi

##### Fortran must die
I thought I had this covered in detail, but if so I apparently lost it.

When you discover you have non-normal residuals the first suggestion is commonly logging the Y and if that does not work squaring the Y. What do you do if that does not work (in the context of linear regression). I know there are non-parametric approaches and robust regression. I am trying to find out what you can do if you want to stick to linear regression.

Similarly with unequal equal variance. Is there any type of test other than inspecting the residuals to detect this. I have seen two recomendations to deal with this. First, is a transformation which is simplest. But say that does not work. Several robust standard errors have been suggested (e.g., White's). Does anyone have any preference for any of these?

#### CowboyBear

##### Super Moderator
When you discover you have non-normal residuals the first suggestion is commonly logging the Y
Is it? It seems like this would be helpful if the errors have a lognormal distribution, but I can't see why this would be a general suggestion. Which sources are you getting this idea from?

If the problem is just with non-normal errors, bootstrapping seems like a good choice if you want to stick with the basic linear regression framework. You can also think about whether you need to do anything at all about the non-normal errors - I know we've said this tons of times, but the coefficients remain unbiased consistent and efficient (BLUE at least) if errors are non-normal but other assumptions are met. The sampling distribuiton of the coefficients won't be exactly normal if the errors aren't normal, which could theoretically muck up significance tests and confidence intervals, but it will converge to a normal distribution with larger sample sizes.

First, is a transformation which is simplest.
Simple to do, but makes interpretation hard. E.g., How do you interpret a regression coefficient after a square root transformation of the Y?

#### Miner

##### TS Contributor
It depends on why the residuals are non-normal. Is it due to trying to fit a nonlinear relationship with linear regression? Or due to variances changing with respect to the magnitude of the independent variable? Or with variances changing over time due to a latent variable? The approach taken for each would be different.

In the first scenario, a transformation using Tukey's Ladder of Power may work. In the second scenario, a logarithmic transformation may stabilize the variances, and in the third, adding the latent variable to the model may be necessary.

#### hlsmith

##### Omega Contributor
Don't forget about transforming the independent variables.

CB, are you referring to bootstapping the sample and rerunning the model and then using the 95 percentile CI from the bootstrapping distribution?

#### noetsi

##### Fortran must die
CBW when I say common I meant that this is what I have always seen recommended in the links, classes, and other sources I have encountered. I have not seen bootstrapping recommended probably because (like me) most don't know how to do that I agree that interpretation is difficult with transformations. When I conduct analysis I am rarely interested in what the slope of a variable is. I want to know, or those who I run the data for want to know, if the results are signficant or not. The literature I have seen in the social sciences stresses the same, this variable was signficant or was not. So the test not the effect size is the driving force. Which variable is relatively more important is critical to me, but slopes don't tell you that (as dason has reminded me on more than one occasion).

Miner I almost never work with data where there is theory (or theory I have found to date). So knowing if there is a latent variable would be difficult. I was curious how you would know what was the cause of the non-normality.

I was wondering, since this got no comments, if anyone has worked with White's SE.

#### hlsmith

##### Omega Contributor
noetsi,

Yes, I believe I have seen White's sandwich estimators mentioned. I have seen them too when you may not be able to easily address observation dependence.

#### noetsi

##### Fortran must die
I had thought that SAS only did White in the PROC MODEL which is not ideal for me not only because it uses very different code than I am used to, but because it assumes a non-linear model. However it turns out that you can do these in PROC REG and PROC LOGISTICS by specifying ACOV. If you specify SPEC it performs a test that the results are homoscedastic - which I think is the WHITE test [the sas documentation is a bit unclear here]

ODS graphics on;
PROC REG DATA=WORK.test2
PLOTS(ONLY)=ALL
;
Linear_Regression_Model: MODEL WEEKLYEARNINGS_CLO = WEEKLYEARNINGS_ACC
/ SELECTION=NONE
ACOV
VIF SPEC
;
RUN;
ODS graphics off;
QUIT;

The test of the first and second moment is the way SAS refers to the White test.

Last edited:

#### hlsmith

##### Omega Contributor
Yeah, I think you can use versions in mixed models as well. I think there is a straight-up "/ white" option in a procedure or at least versions of Robust estimators are available.

Why would robust estimators be used in Logistic, dependence or preferences?

#### noetsi

##### Fortran must die
I have read in a variety of sources that logistic regression does not require equal error variance. But then I ran into this comment in a statistical class,,

Warning: Heteroskedasticity can be very problematic with methods besides OLS. For example, in logistic regression heteroskedasticity can produce biased and misleading parameter estimates.
I was wondering if maybe this was really dealing with unobserved heterogenity although that seems different.

#### Miner

##### TS Contributor
Miner I almost never work with data where there is theory (or theory I have found to date). So knowing if there is a latent variable would be difficult. I was curious how you would know what was the cause of the non-normality.
I plot the residuals in both time sequence if known and by fits. Heteroskedacity will show up in the latter plot, and if the time sequence shows a shift or a trend, it is a sure sign of a latent variable. However, I do have an advantage working in industrial statistics because the time sequence is usually known and there is usually a physical basis for theory. It is also usually relatively quick and easy to run a confirmation experiment on any predictions.

#### noetsi

##### Fortran must die
I am not sure what the equivalent test would be for the data I run (which is not an industrial process).

I am not sure what you mean by plotting the residuals in both time sequences or by fits?

#### CowboyBear

##### Super Moderator
CBW when I say common I meant that this is what I have always seen recommended in the links, classes, and other sources I have encountered. I have not seen bootstrapping recommended probably because (like me) most don't know how to do that
Bootstrapping is pretty well known in the social sciences, e.g., look how many hits come up from a Google Scholar search for bootstrapping psychology.

Implementing bootstrapping would've been hard once upon a time, but for something simple like a regression we're talking clicking a button in SPSS, or a few lines of code in R or SAS.

I am not saying that a log transformation can never be useful, but it is only useful in a very restricted circumstance: I.e., when the logarithm of the distribution of the errors is approximately normal.

When I conduct analysis I am rarely interested in what the slope of a variable is. I want to know, or those who I run the data for want to know, if the results are signficant or not. The literature I have seen in the social sciences stresses the same, this variable was signficant or was not.
I feel like you have a tendency to see something stated in a few sources and then make a premature conclusion about the consensus in the field. There is a huge literature in the social sciences concerned with the limitations of significance testing, and the difference between practical and statistical significance.

Have a look at the results for:

or

On a purely common sense level, do you really not care at all about the size of the relationships between variables? If so, why would you possibly care about whether or not those relationships are nonzero? That is all that statistical significance testing (attempts to) show you.

Which variable is relatively more important is critical to me, but slopes don't tell you that (as dason has reminded me on more than one occasion).
I guess it depends on your definition of "important", but in general the slope tells you a lot more than the p value about the practical importance of the relationship.

EDIT: Sorry if the above seems grumbly - not my intent. I haven't had enough coffee yet

#### noetsi

##### Fortran must die
Not a problem. I have not done research for many years. I did earn a master's in measurement and statistics (if in education) recently and I do spend signficant amounts of time looking at the literature on methods (I have hundreds of pages of typed notes on these topics - one of these is driving this question). What a few is is subject to debate, but I have looked at what seems to me to be quite a lot of sources. Psychology like economics is a branch of the social sciences that is far more concerned with methods than the fields I have worked in (notably administration, but others as well).

I understand the issue of substantive versus statistical signficance. Explaining it to a non-statistical audience is not simple (and in a PHD in public management and several master's I have never seen this topic raised in any of the main journals in my fields).

I will work on learning bootstrapping. I worked in simulation recently in SAS, can't be that much tougher right?

#### CowboyBear

##### Super Moderator
NI understand the issue of substantive versus statistical signficance. Explaining it to a non-statistical audience is not simple
Sure, communicating what relationship size might be practically significant is difficult, but I can practically guarantee you that the people you're talking to don't understand what statistical significance means it's a complicated concept. If the argument is to keep reporting simple and easy to understand, then there's really no place for significance testing.

I will work on learning bootstrapping. I worked in simulation recently in SAS, can't be that much tougher right?
Yep you'll be fine :tup:

#### Injektilo

##### New Member
When I conduct analysis I am rarely interested in what the slope of a variable is. I want to know, or those who I run the data for want to know, if the results are signficant or not. The literature I have seen in the social sciences stresses the same, this variable was signficant or was not. So the test not the effect size is the driving force. Which variable is relatively more important is critical to me, but slopes don't tell you that (as dason has reminded me on more than one occasion).
The test just gives you a yes/no answer. If you want to look at relative impact of different variables, you do need to look at the standardized betas.

#### Miner

##### TS Contributor
I am not sure what the equivalent test would be for the data I run (which is not an industrial process).

I am not sure what you mean by plotting the residuals in both time sequences or by fits?
These are residuals diagnostic plots from Minitab. In addition to the normality test, residuals are plotted against fits to examine for heteroskedacity or curvature. And, if the data are taken in time sequence, the time order plot is examined for shifts and trends that might indicate a latent variable. If the data are not in time sequence, this last graph is ignored.

This site has a good discussion on the residuals vs. fits analysis.

This site discusses both residuals vs. fits and residuals vs. time order.

#### noetsi

##### Fortran must die
The test just gives you a yes/no answer. If you want to look at relative impact of different variables, you do need to look at the standardized betas.
There is some debate if the standardized betas are adequate measures of relative impact even in linear regression. But the real problem occurs in logistic regression where there are no agreed on standardized betas and where the ones that have been proposed generate signficantly different results in some cases. Unfortunately for me, much of my analysis is with logistic regression even though this thread is not on that.

#### noetsi

##### Fortran must die
Thanks miner. I knew of plotting residuals against fitted values. I had not heard of the time series variant (probably because I mainly work in ESM and assumptions are largely ignored in that form of time series).

#### noetsi

##### Fortran must die
Here is a point that has long puzzled me. Does this mean that if we have samples of say 300 (none of mine would ever be less than that) it does not matter if your data is non-normal for Regression? I would say I had read this in many sources, which I have, but than we would have to define "many"

In the field of statistics , there are lots of methods that are practically guaranteed to work well if the data are approximately normally distributed and if all we are interested in are linear combinations of these normally distributed variables. In fact, if our sample sizes are large enough we can use the central limit theorem which tells us that we would expect means to converge on normality so we do not even need to have samples from a normal distribution as N increases. So if we have two groups of say 100 subjects each and we are interested in mean change from baseline of a variable then we have no need to worry and can apply standard statistical methods with only the most basic of checks for statistical validity.
http://www.lexjansen.com/phuse/2005/pk/pk02.pdf

I note this only applies to means. I assume this does apply to slopes, but I am not sure if it applies to say standard errors for example.

Last edited: