+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 15 of 26

Thread: increasing the r squared

  1. #1
    Points: 530, Level: 10
    Level completed: 60%, Points required for next Level: 20

    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts

    increasing the r squared




    i have ran a regression with 4 varibles 1 being independent. my rsqaured is 0.25 is there anyway i can increase this or is there a better statistic that shows the power of the model, sorry im new.

  2. #2
    Ninja say what!?!
    Points: 8,297, Level: 61
    Level completed: 49%, Points required for next Level: 153
    Link's Avatar
    Posts
    1,165
    Thanks
    37
    Thanked 84 Times in 76 Posts

    Re: increasing the r sqaured

    It's ok. We'll help wherever we can.

    An easy way of increasing your r-square value is to just add more predictors (i.e. independent variables) into your regression.

  3. #3
    Points: 530, Level: 10
    Level completed: 60%, Points required for next Level: 20

    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: increasing the r sqaured

    i have lagged the independant varible y also on the right hand side of the equation to y-1 my rsqaured is now =1 is this ok

  4. #4
    TS Contributor
    Points: 5,883, Level: 49
    Level completed: 67%, Points required for next Level: 67
    jpkelley's Avatar
    Location
    Vancouver, BC, Canada
    Posts
    440
    Thanks
    17
    Thanked 90 Times in 84 Posts

    Re: increasing the r sqaured

    Lagging is fine, as long as it fits your question. Could you tell us more about your question (i.e. what your goals are for your modeling)? If your goal is prediction, then you might want to add more variables (some of which may account for very small amounts of the variation in y). If you goal is to find a singular model that best balances explanatory power with simplicity, then you will need to take a model simplification/selection approach.

  5. #5
    R must die
    Points: 24,800, Level: 95
    Level completed: 45%, Points required for next Level: 550
    Awards:
    Activity Award
    noetsi's Avatar
    Posts
    4,520
    Thanks
    274
    Thanked 723 Times in 695 Posts

    Re: increasing the r sqaured

    Adding variables will always increase your r squared, but you only do that if it makes substantive sense. You should use adjusted R square in any case.

    If you Y is bound on one or both ends (for example percentages that go from 0 to 100) logging your data is one way to improve your model. You may also have the wrong function form, e.g., should be curvilinear rather than linear.
    "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise." John Tukey

  6. #6
    Points: 530, Level: 10
    Level completed: 60%, Points required for next Level: 20

    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: increasing the r sqaured

    my model read y + a1 + a2 + y-1 , the y is the salary a1= the years employed a2=years of education. the problem im having is some salaries are higher than others according to male and female im trying to see if their is any statistical signicance at the moment the t value for a1 alot is higher than a2 does this mean that variable has more influence on salary. also do i need a dummy variable for gender to show the differnce thanks

  7. #7
    Points: 530, Level: 10
    Level completed: 60%, Points required for next Level: 20

    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: increasing the r sqaured

    how would i work in logs? would i just need to do this log(y) + log(a1) etc
    Last edited by diesel20056; 12-19-2011 at 01:40 PM. Reason: wrong equation

  8. #8
    R must die
    Points: 24,800, Level: 95
    Level completed: 45%, Points required for next Level: 550
    Awards:
    Activity Award
    noetsi's Avatar
    Posts
    4,520
    Thanks
    274
    Thanked 723 Times in 695 Posts

    Re: increasing the r sqaured

    Yes but why is the y value on the right of the equation?

    Normally the form is Y =b +b1X1 + b2X2.... or something like that.

    This might help

    http://en.wikipedia.org/wiki/Data_tr...on_(statistics)
    "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise." John Tukey

  9. #9
    TS Contributor
    Points: 5,883, Level: 49
    Level completed: 67%, Points required for next Level: 67
    jpkelley's Avatar
    Location
    Vancouver, BC, Canada
    Posts
    440
    Thanks
    17
    Thanked 90 Times in 84 Posts

    Re: increasing the r sqaured

    The scope of this discussion appears to be expanding very rapidly. Maybe it would be good to slow down and go back to your original question.

    Your question as I understand it: What factors influence salaries?
    As independent variables, you have years employed, years of education, and sex.

    Your goal, as it appears to me is NOT prediction, per se (but I could be wrong), but rather to determine how these factors (and their interactions) influence salary (y). At this basic stage AND given only this information, the purpose of including a lag into the analysis is not apparent. If the goal is to determine what factors influence salary, I don't see why you are wanting to increase R-squared. Again, I could be wrong about your approach. Regardless, I think it might be good to step back and think more about your data.

    - Do you have many years' data from each individual (that is, do you have data from John in 2002, 2003, 2004, and did he obtain more education in 2004)?
    - Is your data unbalanced?
    - What does that variation between males and females look like?
    - Sample size?
    - Salaries adjusted for inflation, etc.?

    And, as noetsi suggested, might your data be nonlinear rather than linear?

    On the practical side:
    - What program are you using for analysis?

    Until you ask some of these questions about your data, I would not move towards data-transformation. You might read some of the discussion on this thread:
    http://www.talkstats.com/showthread....understandings

    As a side note:
    If you Y is bound on one or both ends (for example percentages that go from 0 to 100) logging your data is one way to improve your model.
    I would be careful of log-transforming (or using arcsin transformation) on such bounded data, and this probably would not help improve the normality of your model residuals at all. It has been shown that even the proper arcsin transformation for proportional data does not hold up well at the boundaries (<20% and >80%). Given the availability of models with link functions (e.g. GLMMs and their extensions) that allow the user to specify the distribution of their response variable, transformations generally aren't necessary. In my opinion, they just cause trouble. Trouble, I say!

    Anyway, my suggestion is to back up and identify your goals for modeling. And, should your goal be what I mentioned above, I would suggest simplifying your model based on some criterion that penalizing for extra terms in the model (adjusted R-squared, AIC, etc.). You'll end up with a final model that will help you understand the influence of sex, years of education, and whatnot on current salaries.

  10. #10
    R must die
    Points: 24,800, Level: 95
    Level completed: 45%, Points required for next Level: 550
    Awards:
    Activity Award
    noetsi's Avatar
    Posts
    4,520
    Thanks
    274
    Thanked 723 Times in 695 Posts

    Re: increasing the r sqaured

    In terms of using dummy predictor variables

    yes, this is a common thing to do. wage regressions are a good example. economists usually take the log of the dependent variable (wages). and then some regressors (such as education) typically enter the equation linearly -- including some dummy variables like race or sex dummies. while other regressors (such as parents' income) typically enter in log form.
    http://www.stata.com/statalist/archi.../msg00155.html

    This is useful.

    ln(Y) = a + bX + cD + ε . (1)


    Here, X is a continuous regressor, and D is a zero-one dummy variable. The interpretation of the coefficient, b, is that it is the partial derivative of ln(Y) with respect to X. So, 100b is the percentage change in Y for a small change in X (up or down), other things held equal.


    Unfortunately, lots of people (who really should know better) then apply the same "reasoning" to the interpretation of c. The trouble is, of course that D is not continuous, so we can't differentiate ln(Y) with respect to D. The way to get the percentage effect of D on Y is pretty obvious. Curiously enough those same people who go about this the correct way when computing marginal effects in the case of Logit and Probit models just don't seem to do it right in the present context. All we have to do is take the exponential of both sides of equation (1), then evaluate Y when D = 0 and when D = 1. The difference between these two values, divided by the expression for Y based on the starting value of D gives you the correct interpretation immediately:


    If D switches from 0 to 1, the % impact of D on Y is 100[exp(c) - 1]. (2)


    If D switches from 1 to 0, the % impact of D on Y is 100[exp(-c) - 1]. (3)
    http://davegiles.blogspot.com/2011/0...r-dummies.html

    I always thought that if you transformed the Y value to a log you had to do the same to all predictor variables. But neither of the comments above suggest this.
    "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise." John Tukey

  11. #11
    R must die
    Points: 24,800, Level: 95
    Level completed: 45%, Points required for next Level: 550
    Awards:
    Activity Award
    noetsi's Avatar
    Posts
    4,520
    Thanks
    274
    Thanked 723 Times in 695 Posts

    Re: increasing the r sqaured

    Given the availability of models with link functions (e.g. GLMMs and their extensions) that allow the user to specify the distribution of their response variable, transformations generally aren't necessary. In my opinion, they just cause trouble. Trouble, I say!
    But such transformation are common (particularly in economics and business) possibly because most don't know GLMM's (by which I assume you mean methods like logistic regression or probit). The common wisdom I have read is that one should use logs with ceilings and floors (although the common wisdom I have been taught appears to be commonly wrong).

    Reading the original post I think the point is to increase R squared (that is better predictive capacity).

    I found this useful in explaining the use of logs

    http://www.ats.ucla.edu/stat/sas/faq...erpret_log.htm
    "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise." John Tukey

  12. #12
    TS Contributor
    Points: 5,883, Level: 49
    Level completed: 67%, Points required for next Level: 67
    jpkelley's Avatar
    Location
    Vancouver, BC, Canada
    Posts
    440
    Thanks
    17
    Thanked 90 Times in 84 Posts

    Re: increasing the r squared

    They can cause trouble. For proportional data, the arcsin(sqrt(x)) transformation is normally used. Again, at the boundaries, these do not work. As an example, I generated a highly skewed set of proportional data (i.e. close to 1) and then plotted the raw data, log-transformed, and the arcsin transformation.

    Is this what you were talking about?

    [EDIT: The right panel histogram should be labeled "arcsin transformed"]
    Attached Images  
    Last edited by jpkelley; 12-19-2011 at 02:41 PM. Reason: histogram title

  13. #13
    R must die
    Points: 24,800, Level: 95
    Level completed: 45%, Points required for next Level: 550
    Awards:
    Activity Award
    noetsi's Avatar
    Posts
    4,520
    Thanks
    274
    Thanked 723 Times in 695 Posts

    Re: increasing the r squared

    Actually I was talking about this type of recommendation "Three commonly cited transformations to reduce positive skewness, in order of their impact are as follows: Square root transformation, log transformation, and reflected inverse transformation" from "Data Analysis Using SAS Enterprise Guide." The same book states that transformations are commonly recommended (which is my observation of the literature as well).

    This notes that floors commonly cause positive skew.

    http://books.google.com/books?id=pU6...20skew&f=false

    I was wrong to assume both ceiling and floors could be addressed by logs, since negative skew caused by ceilings does not use it.
    "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise." John Tukey

  14. #14
    Points: 530, Level: 10
    Level completed: 60%, Points required for next Level: 20

    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: increasing the r squared

    - Is your data unbalanced?
    i think so the wages are higher for the male section than the female section of the data which suggests men are getting paid more for the same work

    - What does that variation between males and females look like?
    - Sample size? is 70 35 of each sex
    - Salaries adjusted for inflation, etc.? yes

    the relationship does look linear i see two horizontal lines going lower to higher when the males are looked at, i was wanting to add a log to make it more accurate is this not needed with out the log of salary the r sqaured is 0.3 , also of the the t stats for the independents is negative the other 2 are positive

  15. #15
    R must die
    Points: 24,800, Level: 95
    Level completed: 45%, Points required for next Level: 550
    Awards:
    Activity Award
    noetsi's Avatar
    Posts
    4,520
    Thanks
    274
    Thanked 723 Times in 695 Posts

    Re: increasing the r squared


    With financial data it is not unusual to see heteroskedacity. That is residuals' range (their spread) increases as the value of the data points get larger. Ignoring issues of skew, a log may help address this and (sometimes) make your explained variance better as a result.
    "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise." John Tukey

+ Reply to Thread
Page 1 of 2 1 2 LastLast

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats