increasing the r squared

#1
i have ran a regression with 4 varibles 1 being independent. my rsqaured is 0.25 is there anyway i can increase this or is there a better statistic that shows the power of the model, sorry im new.
 

Link

Ninja say what!?!
#2
Re: increasing the r sqaured

It's ok. We'll help wherever we can.

An easy way of increasing your r-square value is to just add more predictors (i.e. independent variables) into your regression.
 
#3
Re: increasing the r sqaured

i have lagged the independant varible y also on the right hand side of the equation to y-1 my rsqaured is now =1 is this ok
 

jpkelley

TS Contributor
#4
Re: increasing the r sqaured

Lagging is fine, as long as it fits your question. Could you tell us more about your question (i.e. what your goals are for your modeling)? If your goal is prediction, then you might want to add more variables (some of which may account for very small amounts of the variation in y). If you goal is to find a singular model that best balances explanatory power with simplicity, then you will need to take a model simplification/selection approach.
 

noetsi

Fortran must die
#5
Re: increasing the r sqaured

Adding variables will always increase your r squared, but you only do that if it makes substantive sense. You should use adjusted R square in any case.

If you Y is bound on one or both ends (for example percentages that go from 0 to 100) logging your data is one way to improve your model. You may also have the wrong function form, e.g., should be curvilinear rather than linear.
 
#6
Re: increasing the r sqaured

my model read y + a1 + a2 + y-1 , the y is the salary a1= the years employed a2=years of education. the problem im having is some salaries are higher than others according to male and female im trying to see if their is any statistical signicance at the moment the t value for a1 alot is higher than a2 does this mean that variable has more influence on salary. also do i need a dummy variable for gender to show the differnce thanks
 

jpkelley

TS Contributor
#9
Re: increasing the r sqaured

The scope of this discussion appears to be expanding very rapidly. Maybe it would be good to slow down and go back to your original question.

Your question as I understand it: What factors influence salaries?
As independent variables, you have years employed, years of education, and sex.

Your goal, as it appears to me is NOT prediction, per se (but I could be wrong), but rather to determine how these factors (and their interactions) influence salary (y). At this basic stage AND given only this information, the purpose of including a lag into the analysis is not apparent. If the goal is to determine what factors influence salary, I don't see why you are wanting to increase R-squared. Again, I could be wrong about your approach. Regardless, I think it might be good to step back and think more about your data.

- Do you have many years' data from each individual (that is, do you have data from John in 2002, 2003, 2004, and did he obtain more education in 2004)?
- Is your data unbalanced?
- What does that variation between males and females look like?
- Sample size?
- Salaries adjusted for inflation, etc.?

And, as noetsi suggested, might your data be nonlinear rather than linear?

On the practical side:
- What program are you using for analysis?

Until you ask some of these questions about your data, I would not move towards data-transformation. You might read some of the discussion on this thread:
http://www.talkstats.com/showthread.php/18573-Frequent-Statistical-Misunderstandings

As a side note:
If you Y is bound on one or both ends (for example percentages that go from 0 to 100) logging your data is one way to improve your model.
I would be careful of log-transforming (or using arcsin transformation) on such bounded data, and this probably would not help improve the normality of your model residuals at all. It has been shown that even the proper arcsin transformation for proportional data does not hold up well at the boundaries (<20% and >80%). Given the availability of models with link functions (e.g. GLMMs and their extensions) that allow the user to specify the distribution of their response variable, transformations generally aren't necessary. In my opinion, they just cause trouble. Trouble, I say!

Anyway, my suggestion is to back up and identify your goals for modeling. And, should your goal be what I mentioned above, I would suggest simplifying your model based on some criterion that penalizing for extra terms in the model (adjusted R-squared, AIC, etc.). You'll end up with a final model that will help you understand the influence of sex, years of education, and whatnot on current salaries.
 

noetsi

Fortran must die
#10
Re: increasing the r sqaured

In terms of using dummy predictor variables

yes, this is a common thing to do. wage regressions are a good example. economists usually take the log of the dependent variable (wages). and then some regressors (such as education) typically enter the equation linearly -- including some dummy variables like race or sex dummies. while other regressors (such as parents' income) typically enter in log form.
http://www.stata.com/statalist/archive/2005-09/msg00155.html

This is useful.

ln(Y) = a + bX + cD + ε . (1)


Here, X is a continuous regressor, and D is a zero-one dummy variable. The interpretation of the coefficient, b, is that it is the partial derivative of ln(Y) with respect to X. So, 100b is the percentage change in Y for a small change in X (up or down), other things held equal.


Unfortunately, lots of people (who really should know better) then apply the same "reasoning" to the interpretation of c. The trouble is, of course that D is not continuous, so we can't differentiate ln(Y) with respect to D. The way to get the percentage effect of D on Y is pretty obvious. Curiously enough those same people who go about this the correct way when computing marginal effects in the case of Logit and Probit models just don't seem to do it right in the present context. All we have to do is take the exponential of both sides of equation (1), then evaluate Y when D = 0 and when D = 1. The difference between these two values, divided by the expression for Y based on the starting value of D gives you the correct interpretation immediately:


If D switches from 0 to 1, the % impact of D on Y is 100[exp(c) - 1]. (2)


If D switches from 1 to 0, the % impact of D on Y is 100[exp(-c) - 1]. (3)
http://davegiles.blogspot.com/2011/03/dummies-for-dummies.html

I always thought that if you transformed the Y value to a log you had to do the same to all predictor variables. But neither of the comments above suggest this.
 

noetsi

Fortran must die
#11
Re: increasing the r sqaured

Given the availability of models with link functions (e.g. GLMMs and their extensions) that allow the user to specify the distribution of their response variable, transformations generally aren't necessary. In my opinion, they just cause trouble. Trouble, I say!
But such transformation are common (particularly in economics and business) possibly because most don't know GLMM's (by which I assume you mean methods like logistic regression or probit). The common wisdom I have read is that one should use logs with ceilings and floors (although the common wisdom I have been taught appears to be commonly wrong).

Reading the original post I think the point is to increase R squared (that is better predictive capacity).

I found this useful in explaining the use of logs

http://www.ats.ucla.edu/stat/sas/faq/sas_interpret_log.htm
 

jpkelley

TS Contributor
#12
They can cause trouble. For proportional data, the arcsin(sqrt(x)) transformation is normally used. Again, at the boundaries, these do not work. As an example, I generated a highly skewed set of proportional data (i.e. close to 1) and then plotted the raw data, log-transformed, and the arcsin transformation.

Is this what you were talking about?

[EDIT: The right panel histogram should be labeled "arcsin transformed"]
 
Last edited:

noetsi

Fortran must die
#13
Actually I was talking about this type of recommendation "Three commonly cited transformations to reduce positive skewness, in order of their impact are as follows: Square root transformation, log transformation, and reflected inverse transformation" from "Data Analysis Using SAS Enterprise Guide." The same book states that transformations are commonly recommended (which is my observation of the literature as well).

This notes that floors commonly cause positive skew.

http://books.google.com/books?id=pU...v=onepage&q=what causes positive skew&f=false

I was wrong to assume both ceiling and floors could be addressed by logs, since negative skew caused by ceilings does not use it.
 
#14
- Is your data unbalanced?
i think so the wages are higher for the male section than the female section of the data which suggests men are getting paid more for the same work

- What does that variation between males and females look like?
- Sample size? is 70 35 of each sex
- Salaries adjusted for inflation, etc.? yes

the relationship does look linear i see two horizontal lines going lower to higher when the males are looked at, i was wanting to add a log to make it more accurate is this not needed with out the log of salary the r sqaured is 0.3 , also of the the t stats for the independents is negative the other 2 are positive
 

noetsi

Fortran must die
#15
With financial data it is not unusual to see heteroskedacity. That is residuals' range (their spread) increases as the value of the data points get larger. Ignoring issues of skew, a log may help address this and (sometimes) make your explained variance better as a result.
 

noetsi

Fortran must die
#17
One thing to remember, and it is easy to forget, is that you can get everything right and still have poor R squared values. If the pheneomenon is really complex no small set of variables may in fact explain much.
 
#18
whats considered poor ideally i would like 1, but i know that aint happening also is the durban watson test worth looking at in terms of being over 0.8 for auto correlation
 
Last edited:

jpkelley

TS Contributor
#20
whats considered poor ideally i would like 1
I don't know of anyone who has ever achieved this with empirical datasets. As to what is considered poor, that depends on your system. In physiology, where patterns of covariation are typically governed by tight constraints (e.g. receptor binding affinities, cardiac output, etc.), "good" r-squared values are likely to be high. For ecological data (e.g. influence of rain on mating probability of frogs), "good" r-squared values might be very low (0.25). This is why people typically don't examine r-squared during the model simplification process. We just look for the best model for the data we have, assuming that the data we collected can adequately address the question on hand. Even adjusted r-squared (which accounts somewhat for model complexity) has been shown to perform worse than AIC for selecting the simplest and best explanatory models, especially for data with nonlinearities.

You're smart to look for autocorrelation in your data. I would be more interested to know if you have repeated-measures data (more than one data row or measurement per individual).

Good luck. Sounds like a cool data set!