PLS HELP! Regression with non-normally distributed errors

#1
Hello Everyone,

I'm desperatelly looking for help regarding an regression analysis I'd like to conduct. Apparently the residuals/error terms of the dependent variable are not normally distributed. A shapiro wilk test has confirmed this. However, different visualizations suggest that the distribution is not that far away from a normal distribution. I have already tried to transform the data of the Y varibale, but about half the data has negative values, which eliminates logarithms and square roots. The transformation 1/y stressed the problem even more. The data sample contains about 100 observtions.

What should I do?
Thank you so much!
 

rogojel

TS Contributor
#2
hi,
Could you give a bit more details, some data maybe? As a quick idea, adding a constant to all your data points will not change your model (except for the intercept) so you can easily get rid of negative values if that is the problem.
regards
rogojel
 
#3
Hi rogojel,

thanks for your answer. I have uploaded the data. IRR is the dependent variable. I know there are some values that are considered outliers, but I have tried to exclude some without success in terms of normality.

What else do you need?
 

rogojel

TS Contributor
#4
hi,
having had a quick look at the data I think the problem is that you have essentially a flat line (no dependence between Exit rate and Irr . The variability of the data is increasing with the exit rate though - and this explains why your residuals are not normal. Taking the logarithm of the Irr helps a bit but I do not see much sense in trying to regress Irr on exit rate with such weak connection. (R-sq for the log of irr against exit rate is 2.5%)

Trying to model the variability of irr as a function of exit rate looks more promising , if this is an interesting question for you.

regards
rogojel
 

rogojel

TS Contributor
#5
hi,
just looking at the data further, the varaiabiliy is also not simpe to model. Is it possible that you have a mixture of different types of data here?

regards
rogojel
 
#6
Hi, thanks for your input!
The data is both in percent, so there is no difference in the type of the data.
I'm aware of the fact that the explanatory power is rather low. I'm replicating a model that is used plenty of times in the literature. The replication includes some more explanatory variables (see teh attached file). Depending on the variables Iinclude I get R-squared between 15 and 28, which is quite alright. Even if R-squared was lower, it would be okay.

Is there some way that I can still proceed with a regression in some other form. Do you think a robust regression would solve the problem with the independent variable?

Thank you so much!
 

noetsi

Fortran must die
#7
I have already tried to transform the data of the Y varibale, but about half the data has negative values, which eliminates logarithms and square roots.
Actually what you do is add a constant to all the data so that the lowest value is positive. Than you log it. So if the lowest point is -42 you add 43 to all points and then log the results.
 

rogojel

TS Contributor
#8
Hi, thanks for your input!
The data is both in percent, so there is no difference in the type of the data.[\QUOTE]

hi,
I mean something like some of the data coming from one type of input and some from another. E.g. if these were related to returns on stocks, you might have some stocks from the auto industry and some from chemicals, and they might behave differently.

regards
rogojel
 
#9
I mean something like some of the data coming from one type of input and some from another. E.g. if these were related to returns on stocks, you might have some stocks from the auto industry and some from chemicals, and they might behave differently
This should not be the issue. These are private equity returns, thats why there might be such high discrepancies and extreme values in the data set.

Actually what you do is add a constant to all the data so that the lowest value is positive. Than you log it. So if the lowest point is -42 you add 43 to all points and then log the results.
I have tried this in the meantime. It does not get any better.

Is there some other regression that is a bit more lax regarding the normality assumption?
 

noetsi

Fortran must die
#10
Logistic regression does not assume normality although you would normally not use it with interval data. Robust regression is designed to deal with outliers (as are methods that deal with M and S estimators). They may reduce the impact of normality violations.

One issue that has not been raised is why exactly you are concerned with non-normality. They only influence the standard errors, and the test of signficance, not the parameter estimates. More importantly regression is robust to assumptions of normality at least if you have a fair number of cases.
 
#11
Thank you very much noetsi!
They may reduce the impact of normality violations.
Does this also account for the independent variable (sorry for my lack of knowledge)? If yes, then I should be fine with a robust regression, should'nt I?!

One issue that has not been raised is why exactly you are concerned with non-normality. They only influence the standard errors, and the test of signficance, not the parameter estimates
Well, I have just checked the respective literature what I should look out for. Furthermore, as you wrote, non-normality might influence the test of significance, which is important to me.

More importantly regression is robust to assumptions of normality at least if you have a fair number of cases.
What is a fair number, I can also refer to in academic terms?
 
Last edited:

noetsi

Fortran must die
#12
Robust regression deals with the estimation of the regression line. You don't care about normality in the DV or IV. You care about normality of the residuals in the regression. The IV and DV normality does not matter at all for regression. All that matters is if the residuals are normal. That point gets missed a lot in treatments of normality which tend to focus on univariate analysis of normality (that is in the raw data).

In my experience, and I found this out painfully comming here, normality is badly distorted in the literature. First they focus on normality specifically in the DV or IV which does not matter. Second they ignore that regression is robust to violations of normality (although what that means in practice is never very clear). But you can likely have mild non-normality in the residuals and if you have enough cases it won't matter that much. What a fair number is is never really defined in concrete terms in part because it depends on how many predictors you have. If you have several hundred cases and a few predictors you likely have a large sample.

One possibility is to use one of the non-parametric tests and see if the results you get are generally similar. If they are use the regression (or at least you can have more confidence in the regression results).
 
#13
Well, thank you! That clearifies the issue for the purpose of my research pretty much.

One last question: Do you think I am fine, going with the Spearman correlation as a back-up?

Greetings,
CheersToStata
 

noetsi

Fortran must die
#14
I am not sure what you mean by a backup but if you mean as a substitute for a non-parametric test, than I have not heard of that being done. Note I am not particularly experienced with non-parametric tests which I don't use (they are very important, I never had a chance to learn them).
 
#15
I meant to use spearman as a backup for the regression with OLS as you suggested.
Since the regression tells the story I want to tell, I will just doublecheck the significance of the variables with a spearman correlation test. I'm pretty sure that is the right choice!

Thanks again for your help!
 

noetsi

Fortran must die
#16
I am not sure where I suggested using spearman's. Using it is not the same thing as using a non-parametric equivalent to regression (or in any case is not what I meant by doing that).