- Thread starter diesel20056
- Start date

Lagging is fine, as long as it fits your question. Could you tell us more about your question (i.e. what your goals are for your modeling)? If your goal is prediction, then you might want to add more variables (some of which may account for very small amounts of the variation in y). If you goal is to find a singular model that best balances explanatory power with simplicity, then you will need to take a model simplification/selection approach.

Adding variables will always increase your r squared, but you only do that if it makes substantive sense. You should use adjusted R square in any case.

If you Y is bound on one or both ends (for example percentages that go from 0 to 100) logging your data is one way to improve your model. You may also have the wrong function form, e.g., should be curvilinear rather than linear.

my model read y + a1 + a2 + y-1 , the y is the salary a1= the years employed a2=years of education. the problem im having is some salaries are higher than others according to male and female im trying to see if their is any statistical signicance at the moment the t value for a1 alot is higher than a2 does this mean that variable has more influence on salary. also do i need a dummy variable for gender to show the differnce thanks

Yes but why is the y value on the right of the equation?

Normally the form is Y =b +b1X1 + b2X2.... or something like that.

This might help

http://en.wikipedia.org/wiki/Data_transformation_(statistics)

The scope of this discussion appears to be expanding very rapidly. Maybe it would be good to slow down and go back to your original question.

As independent variables, you have years employed, years of education, and sex.

Your goal, as it appears to me is NOT prediction, per se (but I could be wrong), but rather to determine how these factors (and their interactions) influence salary (y). At this basic stage AND given only this information, the purpose of including a lag into the analysis is not apparent. If the goal is to determine what factors influence salary, I don't see why you are wanting to increase R-squared. Again, I could be wrong about your approach. Regardless, I think it might be good to step back and think more about your data.

- Do you have many years' data from each individual (that is, do you have data from John in 2002, 2003, 2004, and did he obtain more education in 2004)?

- Is your data unbalanced?

- What does that variation between males and females look like?

- Sample size?

- Salaries adjusted for inflation, etc.?

And, as noetsi suggested, might your data be nonlinear rather than linear?

On the practical side:

- What program are you using for analysis?

Until you ask some of these questions about your data, I would not move towards data-transformation. You might read some of the discussion on this thread:

http://www.talkstats.com/showthread.php/18573-Frequent-Statistical-Misunderstandings

As a side note:

If you Y is bound on one or both ends (for example percentages that go from 0 to 100) logging your data is one way to improve your model.

Anyway, my suggestion is to back up and identify your goals for modeling. And, should your goal be what I mentioned above, I would suggest simplifying your model based on some criterion that penalizing for extra terms in the model (adjusted R-squared, AIC, etc.). You'll end up with a final model that will help you understand the influence of sex, years of education, and whatnot on current salaries.

In terms of using dummy predictor variables

yes, this is a common thing to do. wage regressions are a good example. economists usually take the log of the dependent variable (wages). and then some regressors (such as education) typically enter the equation linearly -- including some dummy variables like race or sex dummies. while other regressors (such as parents' income) typically enter in log form.

This is useful.

ln(Y) = a + bX + cD + ε . (1)

Here, X is a continuous regressor, and D is a zero-one dummy variable. The interpretation of the coefficient, b, is that it is the partial derivative of ln(Y) with respect to X. So, 100b is the percentage change in Y for a small change in X (up or down), other things held equal.

Unfortunately, lots of people (who really should know better) then apply the same "reasoning" to the interpretation of c. The trouble is, of course that D is not continuous, so we can't differentiate ln(Y) with respect to D. The way to get the percentage effect of D on Y is pretty obvious. Curiously enough those same people who go about this the correct way when computing marginal effects in the case of Logit and Probit models just don't seem to do it right in the present context. All we have to do is take the exponential of both sides of equation (1), then evaluate Y when D = 0 and when D = 1. The difference between these two values, divided by the expression for Y based on the starting value of D gives you the correct interpretation immediately:

If D switches from 0 to 1, the % impact of D on Y is 100[exp(c) - 1]. (2)

If D switches from 1 to 0, the % impact of D on Y is 100[exp(-c) - 1]. (3)

Here, X is a continuous regressor, and D is a zero-one dummy variable. The interpretation of the coefficient, b, is that it is the partial derivative of ln(Y) with respect to X. So, 100b is the percentage change in Y for a small change in X (up or down), other things held equal.

Unfortunately, lots of people (who really should know better) then apply the same "reasoning" to the interpretation of c. The trouble is, of course that D is not continuous, so we can't differentiate ln(Y) with respect to D. The way to get the percentage effect of D on Y is pretty obvious. Curiously enough those same people who go about this the correct way when computing marginal effects in the case of Logit and Probit models just don't seem to do it right in the present context. All we have to do is take the exponential of both sides of equation (1), then evaluate Y when D = 0 and when D = 1. The difference between these two values, divided by the expression for Y based on the starting value of D gives you the correct interpretation immediately:

If D switches from 0 to 1, the % impact of D on Y is 100[exp(c) - 1]. (2)

If D switches from 1 to 0, the % impact of D on Y is 100[exp(-c) - 1]. (3)

I always thought that if you transformed the Y value to a log you had to do the same to all predictor variables. But neither of the comments above suggest this.

Given the availability of models with link functions (e.g. GLMMs and their extensions) that allow the user to specify the distribution of their response variable, transformations generally aren't necessary. In my opinion, they just cause trouble. Trouble, I say!

Reading the original post I think the point is to increase R squared (that is better predictive capacity).

I found this useful in explaining the use of logs

http://www.ats.ucla.edu/stat/sas/faq/sas_interpret_log.htm

They can cause trouble. For proportional data, the arcsin(sqrt(x)) transformation is normally used. Again, at the boundaries, these do not work. As an example, I generated a highly skewed set of proportional data (i.e. close to 1) and then plotted the raw data, log-transformed, and the arcsin transformation.

Is this what you were talking about?

[EDIT: The right panel histogram should be labeled "arcsin transformed"]

Is this what you were talking about?

[EDIT: The right panel histogram should be labeled "arcsin transformed"]

Last edited:

This notes that floors commonly cause positive skew.

http://books.google.com/books?id=pU...v=onepage&q=what causes positive skew&f=false

I was wrong to assume both ceiling and floors could be addressed by logs, since negative skew caused by ceilings does not use it.

i think so the wages are higher for the male section than the female section of the data which suggests men are getting paid more for the same work

- What does that variation between males and females look like?

- Sample size? is 70 35 of each sex

- Salaries adjusted for inflation, etc.? yes

the relationship does look linear i see two horizontal lines going lower to higher when the males are looked at, i was wanting to add a log to make it more accurate is this not needed with out the log of salary the r sqaured is 0.3 , also of the the t stats for the independents is negative the other 2 are positive

whats considered poor ideally i would like 1

You're smart to look for autocorrelation in your data. I would be more interested to know if you have repeated-measures data (more than one data row or measurement per individual).

Good luck. Sounds like a cool data set!