checking assumptions

#1
Hi,
I am not really able to understand whether the basic assumptions of linear regression are met (I'm kind of a novice).
I have a dataset that is big enough (as far as I understand), a series of significants coefficients (both individually and togheter) and an adjusted R-squared over 0.9.
Still, I'm afraid that the model does not fit the data (I tried to use log or quadratic form to improve the situation but I did not get such a big improvement).
I attach the residuals plot, looking for some suggestions. Thanks Rplot.JPG
 

noetsi

Fortran must die
#2
You data looks skewed to me based on the QQ plot. Your residual plot is hard to read, but might indicate heteroscedastcity. Certainly you expect it to spread out. You should consider a linearity test something like Box Tidwel.

There are better measures for Cooks, but in any case that is not formally a regression assumption. With a lot of data individual points are unlikely to move the regression line. Trying to find a set of points that does this is .....painful. :p

You should test for Multicolinearity. Something like a VIF test.

I suggest the Sage Monograph Understanding Regression Assumptions by William Berry. Its a good starting point to understand these. I could send you my tome on regression assumptions, but I am not a statistician and it could be wrong :)
 
#3
Thank you for your answer,
I suspect eteroschedasticity as well. I'm going to do some further tests and think over the possibility of introducing some other variable...
I also considered the idea of cutting the distribution in order to exclude the most extreme outliers but I am not sure whether I can justify it properly...
I found Berry's sage in pdf, I will have a look.
Thank you again :)
 

hlsmith

Less is more. Stay pure. Stay poor.
#4
What is your dependent variable? Its kind of seems lower bounded. Yes, trimming outliers is a very serious decision that can't just be made without considerable thought. Can you show the plots zoomed in on the righthand side? It is hard to tell what is going on in that mass of congested dots.
 
#5
Hi, thank you for your reply,
my indipendent variable is a mesure of public investments. It has a large variance:
summary(dataset$INVESTIMENTI)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 162093 401503 1125314 916409 449526635
I attach the residual plots zoomed
1589872065131.png
1589872103504.png
1589872124000.png
1589872138486.png
 

noetsi

Fortran must die
#6
These days the recommendation is not to remove outliers unless they are caused by data errors (you just coded something wrong is an example of that). Instead you should ask why they exist. One possibility is that your true distribution is not normal so they only seem like an outlier (analysis of outliers commonly assume normal distributions I believe).