# Relative impact of regressors on Y.

#### noetsi

##### Fortran must die
A question I get asked a lot is, if we have these three predictors of Y, which of the 3 has the most, next most and least impact. I have tried various ways and never come up with an approach I am really happy with.

I need to do this for both interval and binary DV.

#### ondansetron

##### TS Contributor
A question I get asked a lot is, if we have these three predictors of Y, which of the 3 has the most, next most and least impact. I have tried various ways and never come up with an approach I am really happy with.

I need to do this for both interval and binary DV.
The problem with doing this is that it's usually hard to justify the ranking based purely on the size of estimated beta coefficient. Assume we regress Price of a used car (Y) on mileage, number of previous owners, and transmission type (X1, X2, X3).

The classic slope interpretation would be: For every 1 unit increase in X(n), we expect Y to increase/decrease by |beta(n)|, holding all else constant.

The issue arises because you can't easily say that increasing mileage by 1 mile is equivalent to a 1 person increase in previous owners. The units are different, so it doesn't really make sense to say which has the "most impact" on the DV. Sure, one may elicit a larger change in the DV, but that comes from a given change in X(n), which might not be equal to that same change in another X variable.

I think one (partial) solution is to standardize (at least) the predictors. This way, you can say that a 1 SD change in X1 causes a larger change in Y than a 1 SD change in X2, but again, the standard deviations have units of measure, so it's not a perfect solution, but it does help in a small way (I think, anyway, because it puts these 1 unit increases on a scale of "statistical un-usualness" within their respective distributions).

Thoughts?

#### spunky

##### King of all Drama
If by "impact" you mean something like "which predictor contributes the most to the R-squared measure" you could use something like Pratt's relative importance measure or Budescu's dominance analysis. Unless you have something weird going on (e.g. suppression, multicollinearity, etc.) they usually agree quite a bit and they break down the R-squared into the percentage of explained variance that each predictor contributes towards the overall model fit.

#### hlsmith

##### Omega Contributor
Look up partial r**2, omega square. That is what I usually use. A more intensive way may be to see which variable has the highest var importance using LASSO reg or elastic net, but these aren't really accessible in base SAS.

#### ondansetron

##### TS Contributor
I guess it would depend which way you want to say it "impacts" Y (explained variation vs magnitude of change in Y). The latter is the one I've heard people try to do more commonly, which is why I phrased my response that way, but the former is shown in many stat packages, and I think it's less controversial.

#### hlsmith

##### Omega Contributor
I will throw this out there, just to add to the overall list. There are standardized estimates in linear regression.

Also, much like the LASSO suggestion. If you have a sufficient amount of data, you can run cross-validation and see how good variables perform in other subsamples.

#### noetsi

##### Fortran must die
For linear models I have used standardized betas as suggested. The problem with that approach is that there is significant question whether standardized betas make sense when some of the predictors are dummy variables and there are almost always dummy variables in my model. Commonly, given what I analyze, there are more of them than interval predictors.

Using impact on R squared is an interesting idea although obviously it does not work with categorical DV. I have used for categorical DV, based on suggestions here years ago, the magnitude of the Wald value each predictor has to rank impact. SAS does something very similar with one of its inherent functions for binary DV.

I don't know much about LASSO although I will look into it. I don't understand what this means (what do you do to do this)?

If you have a sufficient amount of data, you can run cross-validation and see how good variables perform in other subsamples

#### hlsmith

##### Omega Contributor
Yes noetsi, I was just throwing STB out there to add to possible options. It too has weaknesses.

I think you can still use partial R^2 with categorical variables. It would be intuitive with binary variables, though with more groups you would just have to make sure you mention what the reference groups is when explaining.

#### spunky

##### King of all Drama
Using impact on R squared is an interesting idea although obviously it does not work with categorical DV.
Well... not exactly. I mean, if you're willing to make a few assumptions about the categorical nature of your DV the Pratt index has been extended to logistic regression. And Azen extended dominance analysis for logistic regression as well. I'm almost sure they even have a SAS macro somewhere, but then again I don't use SAS so me doesn't know.

The 'relaimpo' package in R executes all these R-squared partition measures and a few more. But I don't like the other ones

#### noetsi

##### Fortran must die
I am going to look those approaches up spunky, I know neither. My comment on R square is that there is no generally accepted pseudo R square for logistic models, last time I looked there were like 33 of them which differed significantly from each other

#### spunky

##### King of all Drama
I am going to look those approaches up spunky, I know neither. My comment on R square is that there is no generally accepted pseudo R square for logistic models, last time I looked there were like 33 of them which differed significantly from each other
You are absolutely right. Which is why I covered my basis by saying " if you're willing to make a few assumptions about the categorical nature of your DV" because the R-squared-type measure that is proposed in that article does require you to buy into a few things or else it is nonsensical. I do not quite remember all of them but a big one for me as that it requires you to assume that the binary observed variable arose from the discretization of a continuous, latent variable. Now, if you have a DV where your observed variable is something like " correct/incorrect answer to a test" then sure, I'm willing to believe that maybe there is a latent aptitude score that can only be measured as the response of a test. But if your variable is something more... concrete like, oh I dunno, "man/woman, dead/alive, etc." then yeah, I'd have trouble buying into the latent variable model, in which case the R-squared is nonsensical and the Pratt index is not appropriate.

#### noetsi

##### Fortran must die
I do not quite remember all of them but a big one for me as that it requires you to assume that the binary observed variable arose from the discretization of a continuous, latent variable.
Some, but by no means all interpretations of logistic regression assume exactly that.

BTW when you get your PHD are you going to continue to be humble or become an arrogant jerk ...

#### hlsmith

##### Omega Contributor
I just stumbled across this thread again. LASSO is available in SAS for model with a continuous outcome. It is one of the newer HP (high performance) procedures.