It's ok. We'll help wherever we can.
An easy way of increasing your r-square value is to just add more predictors (i.e. independent variables) into your regression.
i have ran a regression with 4 varibles 1 being independent. my rsqaured is 0.25 is there anyway i can increase this or is there a better statistic that shows the power of the model, sorry im new.
It's ok. We'll help wherever we can.
An easy way of increasing your r-square value is to just add more predictors (i.e. independent variables) into your regression.
i have lagged the independant varible y also on the right hand side of the equation to y-1 my rsqaured is now =1 is this ok
Lagging is fine, as long as it fits your question. Could you tell us more about your question (i.e. what your goals are for your modeling)? If your goal is prediction, then you might want to add more variables (some of which may account for very small amounts of the variation in y). If you goal is to find a singular model that best balances explanatory power with simplicity, then you will need to take a model simplification/selection approach.
Adding variables will always increase your r squared, but you only do that if it makes substantive sense. You should use adjusted R square in any case.
If you Y is bound on one or both ends (for example percentages that go from 0 to 100) logging your data is one way to improve your model. You may also have the wrong function form, e.g., should be curvilinear rather than linear.
"Non-response is only a problem if the non-respondents are a non-random sample of the total sample. Unfortunately, this seems almost always to be the case. "
my model read y + a1 + a2 + y-1 , the y is the salary a1= the years employed a2=years of education. the problem im having is some salaries are higher than others according to male and female im trying to see if their is any statistical signicance at the moment the t value for a1 alot is higher than a2 does this mean that variable has more influence on salary. also do i need a dummy variable for gender to show the differnce thanks
how would i work in logs? would i just need to do this log(y) + log(a1) etc
Last edited by diesel20056; 12-19-2011 at 01:40 PM. Reason: wrong equation
Yes but why is the y value on the right of the equation?
Normally the form is Y =b +b1X1 + b2X2.... or something like that.
This might help
http://en.wikipedia.org/wiki/Data_tr...on_(statistics)
"Non-response is only a problem if the non-respondents are a non-random sample of the total sample. Unfortunately, this seems almost always to be the case. "
The scope of this discussion appears to be expanding very rapidly. Maybe it would be good to slow down and go back to your original question.
Your question as I understand it: What factors influence salaries?
As independent variables, you have years employed, years of education, and sex.
Your goal, as it appears to me is NOT prediction, per se (but I could be wrong), but rather to determine how these factors (and their interactions) influence salary (y). At this basic stage AND given only this information, the purpose of including a lag into the analysis is not apparent. If the goal is to determine what factors influence salary, I don't see why you are wanting to increase R-squared. Again, I could be wrong about your approach. Regardless, I think it might be good to step back and think more about your data.
- Do you have many years' data from each individual (that is, do you have data from John in 2002, 2003, 2004, and did he obtain more education in 2004)?
- Is your data unbalanced?
- What does that variation between males and females look like?
- Sample size?
- Salaries adjusted for inflation, etc.?
And, as noetsi suggested, might your data be nonlinear rather than linear?
On the practical side:
- What program are you using for analysis?
Until you ask some of these questions about your data, I would not move towards data-transformation. You might read some of the discussion on this thread:
http://www.talkstats.com/showthread....understandings
As a side note:
I would be careful of log-transforming (or using arcsin transformation) on such bounded data, and this probably would not help improve the normality of your model residuals at all. It has been shown that even the proper arcsin transformation for proportional data does not hold up well at the boundaries (<20% and >80%). Given the availability of models with link functions (e.g. GLMMs and their extensions) that allow the user to specify the distribution of their response variable, transformations generally aren't necessary. In my opinion, they just cause trouble. Trouble, I say!If you Y is bound on one or both ends (for example percentages that go from 0 to 100) logging your data is one way to improve your model.
Anyway, my suggestion is to back up and identify your goals for modeling. And, should your goal be what I mentioned above, I would suggest simplifying your model based on some criterion that penalizing for extra terms in the model (adjusted R-squared, AIC, etc.). You'll end up with a final model that will help you understand the influence of sex, years of education, and whatnot on current salaries.
In terms of using dummy predictor variables
http://www.stata.com/statalist/archi.../msg00155.htmlyes, this is a common thing to do. wage regressions are a good example. economists usually take the log of the dependent variable (wages). and then some regressors (such as education) typically enter the equation linearly -- including some dummy variables like race or sex dummies. while other regressors (such as parents' income) typically enter in log form.
This is useful.
http://davegiles.blogspot.com/2011/0...r-dummies.htmlln(Y) = a + bX + cD + ε . (1)
Here, X is a continuous regressor, and D is a zero-one dummy variable. The interpretation of the coefficient, b, is that it is the partial derivative of ln(Y) with respect to X. So, 100b is the percentage change in Y for a small change in X (up or down), other things held equal.
Unfortunately, lots of people (who really should know better) then apply the same "reasoning" to the interpretation of c. The trouble is, of course that D is not continuous, so we can't differentiate ln(Y) with respect to D. The way to get the percentage effect of D on Y is pretty obvious. Curiously enough those same people who go about this the correct way when computing marginal effects in the case of Logit and Probit models just don't seem to do it right in the present context. All we have to do is take the exponential of both sides of equation (1), then evaluate Y when D = 0 and when D = 1. The difference between these two values, divided by the expression for Y based on the starting value of D gives you the correct interpretation immediately:
If D switches from 0 to 1, the % impact of D on Y is 100[exp(c) - 1]. (2)
If D switches from 1 to 0, the % impact of D on Y is 100[exp(-c) - 1]. (3)
I always thought that if you transformed the Y value to a log you had to do the same to all predictor variables. But neither of the comments above suggest this.
"Non-response is only a problem if the non-respondents are a non-random sample of the total sample. Unfortunately, this seems almost always to be the case. "
But such transformation are common (particularly in economics and business) possibly because most don't know GLMM's (by which I assume you mean methods like logistic regression or probit). The common wisdom I have read is that one should use logs with ceilings and floors (although the common wisdom I have been taught appears to be commonly wrong).Given the availability of models with link functions (e.g. GLMMs and their extensions) that allow the user to specify the distribution of their response variable, transformations generally aren't necessary. In my opinion, they just cause trouble. Trouble, I say!
Reading the original post I think the point is to increase R squared (that is better predictive capacity).
I found this useful in explaining the use of logs
http://www.ats.ucla.edu/stat/sas/faq...erpret_log.htm
"Non-response is only a problem if the non-respondents are a non-random sample of the total sample. Unfortunately, this seems almost always to be the case. "
They can cause trouble. For proportional data, the arcsin(sqrt(x)) transformation is normally used. Again, at the boundaries, these do not work. As an example, I generated a highly skewed set of proportional data (i.e. close to 1) and then plotted the raw data, log-transformed, and the arcsin transformation.
Is this what you were talking about?
[EDIT: The right panel histogram should be labeled "arcsin transformed"]
Last edited by jpkelley; 12-19-2011 at 02:41 PM. Reason: histogram title
Actually I was talking about this type of recommendation "Three commonly cited transformations to reduce positive skewness, in order of their impact are as follows: Square root transformation, log transformation, and reflected inverse transformation" from "Data Analysis Using SAS Enterprise Guide." The same book states that transformations are commonly recommended (which is my observation of the literature as well).
This notes that floors commonly cause positive skew.
http://books.google.com/books?id=pU6...20skew&f=false
I was wrong to assume both ceiling and floors could be addressed by logs, since negative skew caused by ceilings does not use it.
"Non-response is only a problem if the non-respondents are a non-random sample of the total sample. Unfortunately, this seems almost always to be the case. "
- Is your data unbalanced?
i think so the wages are higher for the male section than the female section of the data which suggests men are getting paid more for the same work
- What does that variation between males and females look like?
- Sample size? is 70 35 of each sex
- Salaries adjusted for inflation, etc.? yes
the relationship does look linear i see two horizontal lines going lower to higher when the males are looked at, i was wanting to add a log to make it more accurate is this not needed with out the log of salary the r sqaured is 0.3 , also of the the t stats for the independents is negative the other 2 are positive
With financial data it is not unusual to see heteroskedacity. That is residuals' range (their spread) increases as the value of the data points get larger. Ignoring issues of skew, a log may help address this and (sometimes) make your explained variance better as a result.
"Non-response is only a problem if the non-respondents are a non-random sample of the total sample. Unfortunately, this seems almost always to be the case. "
Tweet |