# Thread: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method?

1. ## DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method?

Hello dear forum members!

I want to do a robustness check of the estimates that I have obtained previously using OLS estimation. In that case the DV was a rating score with min 0 and max 100. In this case the DV is a rank (I created a rank for 899 entities in the data set based on the corresponding rating score).

Shall I treat the new DV as a "count" and use something like Poisson regression? Or is it an ordinal outcome and shall be estimated with rank-ordered logistic regression? Or neither of the above?

Thank you for the comments and suggestions.

2. ## Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

i say you're just fine with doing regular OLS multiple regression and treat your DV as 'continuous'. usually discrete variables with 5 options (think a likert scale that goes from 1=completely disagree to 5=completely agree) is OKish if the variables are reasonably symmetric. once it hits 7 points you can handle some non-zero skewness/kurtosis estimates... after 10 points you can pretty much treat it as continuous. you say you have 100 discretization points? i say go for it, run regular OLS

3. ## The Following User Says Thank You to spunky For This Useful Post:

kiton (01-25-2015)

4. ## Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

Originally Posted by spunky
i say you're just fine with doing regular OLS multiple regression and treat your DV as 'continuous'. usually discrete variables with 5 options (think a likert scale that goes from 1=completely disagree to 5=completely agree) is OKish if the variables are reasonably symmetric. once it hits 7 points you can handle some non-zero skewness/kurtosis estimates... after 10 points you can pretty much treat it as continuous. you say you have 100 discretization points? i say go for it, run regular OLS

Thank you for prompt and useful reply, Spunky. I appreciate it greatly.

I also came across such reasoning and it looks to make sense. However, I have a couple questions:

- Do you think normalization of the DV is required/reasonable? It gives a higher R^2 compared to non-normalized rank (almost the same as my major analysis).
- Distribution of the residuals becomes a little messed up (does not pass the normality test) - is it a problem for a robustness check then?

Thanks!

5. ## Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

Originally Posted by kiton
- Do you think normalization of the DV is required/reasonable? It gives a higher R^2 compared to non-normalized rank (almost the same as my major analysis).
when you say 'normalization' do you mean 'standardization'? i'm just a little bit thrown off here because the R2 should be the same regardless of whether variables are standardized or not.... so maybe what you mean by normalization is something else.

Originally Posted by kiton
Distribution of the residuals becomes a little messed up (does not pass the normality test) - is it a problem for a robustness check then?
is this the same dataset from that other thread you asked about where your have 100s of sample units? if that is the case, then i wouldn't worry too much. i'm not sure if it's as clearly stated in that regression article as we mention here, but lack of normality of the residuals is the least of your problems in terms of violation of the assumptions if your sample size is very big. 100s of sample units = very big.

6. ## The Following User Says Thank You to spunky For This Useful Post:

kiton (01-26-2015)

7. ## Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

[QUOTE=spunky;169837]when you say 'normalization' do you mean 'standardization'? i'm just a little bit thrown off here because the R2 should be the same regardless of whether variables are standardized or not.... so maybe what you mean by normalization is something else.

Sorry for the confusion. When I say normalization, I mean using natural log. The R^2 with X (1-899) is approximately 36%, and with ln(X) is 56%.

Originally Posted by spunky
is this the same dataset from that other thread you asked about where your have 100s of sample units? if that is the case, then i wouldn't worry too much. i'm not sure if it's as clearly stated in that regression article as we mention here, but lack of normality of the residuals is the least of your problems in terms of violation of the assumptions if your sample size is very big. 100s of sample units = very big.
The data set is the same, N=899. So, in this case I assume that the normality of the residuals for a robustness check should not matter that much. All other assumptions are met (except the fact that data is heteroskedastic, but I am addressing that).

Once again, thank you for feedback.

8. ## Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

and how does the model look like? i think i remember you mentioning something about squaring a term or something right?

either way, R-squared is a measure of the linear fit of the model and i think you said there was some non-linearities there. if the relationship between the dependent variable and the predictors is non-linear then, yeah, i could see how adding some sort of transformation to linearize them or a non-linear term would improve your fit.

9. ## The Following User Says Thank You to spunky For This Useful Post:

kiton (01-26-2015)

10. ## Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

When you get this many levels 1) the data is interval like and 2) methods such as multinomial regression become dysfunctional as interpretation is nearly impossible. So linear regression is reasonable. This is generally true once you get to 12 distinct levels. If R square goes up that much my guess is that some assumption of the linear model was wrong. Log's usually correct for heterogenity, but it makes sense if the original model is non-linear and you transform it to make it linear R square would go up because R square measures the linear fit of a model only.

11. ## Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

Originally Posted by spunky
and how does the model look like? i think i remember you mentioning something about squaring a term or something right?

either way, R-squared is a measure of the linear fit of the model and i think you said there was some non-linearities there. if the relationship between the dependent variable and the predictors is non-linear then, yeah, i could see how adding some sort of transformation to linearize them or a non-linear term would improve your fit.
The model is:

y1 = x1 + x1^2 + x2 + x3 + x4 + x1x2 + x1x3

OLS estimation with continuous DV (y1 - rating score 0-100, N=899) gives a perfect distribution of the residuals and passes the normality test. But OLS estimation with DV that I am trying to use for robustness check (i.e. y2 - a newly generated RANK based on the rating score 1-899, N=899) gives a "slightly messed up" distribution of the residuals. The significance of the estimated coefficients is consistent in both cases (y2 and ln(y2)), but the R^2's are different. So, I am trying to find relevant support for using ln(y2) in case of robustness check.

Also, all assumptions (based on the number of specification test) are met for the original estimated model (i.e. y1). However, they are not completely met (again, based on specification tests) for the robustness check model (i.e. the one with y2 or ln(y2)) That concerns me.

12. ## Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

I don't think I have ever seen a "perfect" distribution of the residuals whatever that is

Nor am I sure what a "slightly messed up" distribution of the residuals means...

What is a robustness check? And why do it if the original model met the assumptions of the model?

13. ## Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

Originally Posted by noetsi
I don't think I have ever seen a "perfect" distribution of the residuals whatever that is

Nor am I sure what a "slightly messed up" distribution of the residuals means...
Attached is an example of a slightly messed up one )

Originally Posted by noetsi
What is a robustness check? And why do it if the original model met the assumptions of the model?
Good question... I just want to double-check (and "double-prove) that my results are correct (robust).

14. ## Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

Usually robust is used to refer to how well a method works when you violate the assumptions of the method Not to whether you conclusions are correct.

If the residuals were correct in the first model and the model you did to make sure you conclusions are right had worse residuals I don't think you will improve the validity of you analysis [personally I suspect that no regression diagnostics are ever perfect. Because you are always going to violate the assumptions to some extent with real data. The question then is really how robust the method is to the violation. Obviously you can correct for discovered problems through transformations, robust SE etc. But if the initial diagnostics are ok correcting, especially if it produces worse diagnostics does not seem the way to go to me].

But then I am hardly an expert in this

15. ## The Following User Says Thank You to noetsi For This Useful Post:

kiton (01-26-2015)

16. ## Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

Originally Posted by kiton
The model is:

y1 = x1 + x1^2 + x2 + x3 + x4 + x1x2 + x1x3

OLS estimation with continuous DV (y1 - rating score 0-100, N=899) gives a perfect distribution of the residuals and passes the normality test. But OLS estimation with DV that I am trying to use for robustness check (i.e. y2 - a newly generated RANK based on the rating score 1-899, N=899) gives a "slightly messed up" distribution of the residuals. The significance of the estimated coefficients is consistent in both cases (y2 and ln(y2)), but the R^2's are different. So, I am trying to find relevant support for using ln(y2) in case of robustness check.
that's one heck of a model to be honest with you. two interaction terms and a quadratic effect? eeew! the interpretation will be complicated. but anyhoo... statistically speaking i still don't see any problems. yeah, the tails of the distribution of the residuals could be better... but your N is almost 1000!!! i mean that by itself puts you in a safe place when it comes to the distributional assumptions.

and, as i mentioned, if you're using a different DV that happens to not look as linear as the original one, i really don't see any issues to sticking with it. just like noetsi, i'm also wary of this "robustness" word you're throwing around because it seems you're just changing the original variable that you're modelling... in which case i don't think there should be any surprises that you're getting different results from the original analysis. but then again if you're worried about using the natural log of the dependent variable and the normality of the residuals i don't think that, in this case, there is anything to worry about.

17. ## The Following User Says Thank You to spunky For This Useful Post:

kiton (01-26-2015)

 Tweet

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts