+ Reply to Thread
Results 1 to 12 of 12

Thread: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method?

  1. #1
    Points: 4,664, Level: 43
    Level completed: 57%, Points required for next Level: 86
    kiton's Avatar
    Location
    Corn field
    Posts
    234
    Thanks
    47
    Thanked 51 Times in 46 Posts

    DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method?




    Hello dear forum members!

    I want to do a robustness check of the estimates that I have obtained previously using OLS estimation. In that case the DV was a rating score with min 0 and max 100. In this case the DV is a rank (I created a rank for 899 entities in the data set based on the corresponding rating score).

    Shall I treat the new DV as a "count" and use something like Poisson regression? Or is it an ordinal outcome and shall be estimated with rank-ordered logistic regression? Or neither of the above?

    Thank you for the comments and suggestions.

  2. #2
    TS Contributor
    Points: 22,383, Level: 93
    Level completed: 4%, Points required for next Level: 967
    spunky's Avatar
    Location
    vancouver, canada
    Posts
    2,135
    Thanks
    166
    Thanked 537 Times in 431 Posts

    Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

    i say you're just fine with doing regular OLS multiple regression and treat your DV as 'continuous'. usually discrete variables with 5 options (think a likert scale that goes from 1=completely disagree to 5=completely agree) is OKish if the variables are reasonably symmetric. once it hits 7 points you can handle some non-zero skewness/kurtosis estimates... after 10 points you can pretty much treat it as continuous. you say you have 100 discretization points? i say go for it, run regular OLS

    source you can cite: http://psych.colorado.edu/~willcutt/...tulla_2012.pdf
    for all your psychometric needs! https://psychometroscar.wordpress.com/about/

  3. The Following User Says Thank You to spunky For This Useful Post:

    kiton (01-25-2015)

  4. #3
    Points: 4,664, Level: 43
    Level completed: 57%, Points required for next Level: 86
    kiton's Avatar
    Location
    Corn field
    Posts
    234
    Thanks
    47
    Thanked 51 Times in 46 Posts

    Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

    Quote Originally Posted by spunky View Post
    i say you're just fine with doing regular OLS multiple regression and treat your DV as 'continuous'. usually discrete variables with 5 options (think a likert scale that goes from 1=completely disagree to 5=completely agree) is OKish if the variables are reasonably symmetric. once it hits 7 points you can handle some non-zero skewness/kurtosis estimates... after 10 points you can pretty much treat it as continuous. you say you have 100 discretization points? i say go for it, run regular OLS

    source you can cite: http://psych.colorado.edu/~willcutt/...tulla_2012.pdf
    Thank you for prompt and useful reply, Spunky. I appreciate it greatly.

    I also came across such reasoning and it looks to make sense. However, I have a couple questions:

    - Do you think normalization of the DV is required/reasonable? It gives a higher R^2 compared to non-normalized rank (almost the same as my major analysis).
    - Distribution of the residuals becomes a little messed up (does not pass the normality test) - is it a problem for a robustness check then?

    Thanks!
    Last edited by kiton; 01-26-2015 at 02:21 AM.

  5. #4
    TS Contributor
    Points: 22,383, Level: 93
    Level completed: 4%, Points required for next Level: 967
    spunky's Avatar
    Location
    vancouver, canada
    Posts
    2,135
    Thanks
    166
    Thanked 537 Times in 431 Posts

    Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

    Quote Originally Posted by kiton View Post
    - Do you think normalization of the DV is required/reasonable? It gives a higher R^2 compared to non-normalized rank (almost the same as my major analysis).
    when you say 'normalization' do you mean 'standardization'? i'm just a little bit thrown off here because the R2 should be the same regardless of whether variables are standardized or not.... so maybe what you mean by normalization is something else.

    Quote Originally Posted by kiton View Post
    Distribution of the residuals becomes a little messed up (does not pass the normality test) - is it a problem for a robustness check then?
    is this the same dataset from that other thread you asked about where your have 100s of sample units? if that is the case, then i wouldn't worry too much. i'm not sure if it's as clearly stated in that regression article as we mention here, but lack of normality of the residuals is the least of your problems in terms of violation of the assumptions if your sample size is very big. 100s of sample units = very big.
    for all your psychometric needs! https://psychometroscar.wordpress.com/about/

  6. The Following User Says Thank You to spunky For This Useful Post:

    kiton (01-26-2015)

  7. #5
    Points: 4,664, Level: 43
    Level completed: 57%, Points required for next Level: 86
    kiton's Avatar
    Location
    Corn field
    Posts
    234
    Thanks
    47
    Thanked 51 Times in 46 Posts

    Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

    [QUOTE=spunky;169837]when you say 'normalization' do you mean 'standardization'? i'm just a little bit thrown off here because the R2 should be the same regardless of whether variables are standardized or not.... so maybe what you mean by normalization is something else.

    Sorry for the confusion. When I say normalization, I mean using natural log. The R^2 with X (1-899) is approximately 36%, and with ln(X) is 56%.



    Quote Originally Posted by spunky View Post
    is this the same dataset from that other thread you asked about where your have 100s of sample units? if that is the case, then i wouldn't worry too much. i'm not sure if it's as clearly stated in that regression article as we mention here, but lack of normality of the residuals is the least of your problems in terms of violation of the assumptions if your sample size is very big. 100s of sample units = very big.
    The data set is the same, N=899. So, in this case I assume that the normality of the residuals for a robustness check should not matter that much. All other assumptions are met (except the fact that data is heteroskedastic, but I am addressing that).

    Once again, thank you for feedback.

  8. #6
    TS Contributor
    Points: 22,383, Level: 93
    Level completed: 4%, Points required for next Level: 967
    spunky's Avatar
    Location
    vancouver, canada
    Posts
    2,135
    Thanks
    166
    Thanked 537 Times in 431 Posts

    Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

    and how does the model look like? i think i remember you mentioning something about squaring a term or something right?

    either way, R-squared is a measure of the linear fit of the model and i think you said there was some non-linearities there. if the relationship between the dependent variable and the predictors is non-linear then, yeah, i could see how adding some sort of transformation to linearize them or a non-linear term would improve your fit.
    for all your psychometric needs! https://psychometroscar.wordpress.com/about/

  9. The Following User Says Thank You to spunky For This Useful Post:

    kiton (01-26-2015)

  10. #7
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

    When you get this many levels 1) the data is interval like and 2) methods such as multinomial regression become dysfunctional as interpretation is nearly impossible. So linear regression is reasonable. This is generally true once you get to 12 distinct levels. If R square goes up that much my guess is that some assumption of the linear model was wrong. Log's usually correct for heterogenity, but it makes sense if the original model is non-linear and you transform it to make it linear R square would go up because R square measures the linear fit of a model only.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  11. #8
    Points: 4,664, Level: 43
    Level completed: 57%, Points required for next Level: 86
    kiton's Avatar
    Location
    Corn field
    Posts
    234
    Thanks
    47
    Thanked 51 Times in 46 Posts

    Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

    Quote Originally Posted by spunky View Post
    and how does the model look like? i think i remember you mentioning something about squaring a term or something right?

    either way, R-squared is a measure of the linear fit of the model and i think you said there was some non-linearities there. if the relationship between the dependent variable and the predictors is non-linear then, yeah, i could see how adding some sort of transformation to linearize them or a non-linear term would improve your fit.
    The model is:

    y1 = x1 + x1^2 + x2 + x3 + x4 + x1x2 + x1x3

    OLS estimation with continuous DV (y1 - rating score 0-100, N=899) gives a perfect distribution of the residuals and passes the normality test. But OLS estimation with DV that I am trying to use for robustness check (i.e. y2 - a newly generated RANK based on the rating score 1-899, N=899) gives a "slightly messed up" distribution of the residuals. The significance of the estimated coefficients is consistent in both cases (y2 and ln(y2)), but the R^2's are different. So, I am trying to find relevant support for using ln(y2) in case of robustness check.

    Also, all assumptions (based on the number of specification test) are met for the original estimated model (i.e. y1). However, they are not completely met (again, based on specification tests) for the robustness check model (i.e. the one with y2 or ln(y2)) That concerns me.

  12. #9
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

    I don't think I have ever seen a "perfect" distribution of the residuals whatever that is

    Nor am I sure what a "slightly messed up" distribution of the residuals means...

    What is a robustness check? And why do it if the original model met the assumptions of the model?
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  13. #10
    Points: 4,664, Level: 43
    Level completed: 57%, Points required for next Level: 86
    kiton's Avatar
    Location
    Corn field
    Posts
    234
    Thanks
    47
    Thanked 51 Times in 46 Posts

    Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

    Quote Originally Posted by noetsi View Post
    I don't think I have ever seen a "perfect" distribution of the residuals whatever that is

    Nor am I sure what a "slightly messed up" distribution of the residuals means...
    Attached is an example of a slightly messed up one )

    Quote Originally Posted by noetsi View Post
    What is a robustness check? And why do it if the original model met the assumptions of the model?
    Good question... I just want to double-check (and "double-prove) that my results are correct (robust).
    Attached Images  

  14. #11
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method

    Usually robust is used to refer to how well a method works when you violate the assumptions of the method Not to whether you conclusions are correct.

    If the residuals were correct in the first model and the model you did to make sure you conclusions are right had worse residuals I don't think you will improve the validity of you analysis [personally I suspect that no regression diagnostics are ever perfect. Because you are always going to violate the assumptions to some extent with real data. The question then is really how robust the method is to the violation. Obviously you can correct for discovered problems through transformations, robust SE etc. But if the initial diagnostics are ok correcting, especially if it produces worse diagnostics does not seem the way to go to me].

    But then I am hardly an expert in this
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  15. The Following User Says Thank You to noetsi For This Useful Post:

    kiton (01-26-2015)

  16. #12
    TS Contributor
    Points: 22,383, Level: 93
    Level completed: 4%, Points required for next Level: 967
    spunky's Avatar
    Location
    vancouver, canada
    Posts
    2,135
    Thanks
    166
    Thanked 537 Times in 431 Posts

    Re: DV is a RANK with highest 1 and lowest 899 - what is the proper estimation method


    Quote Originally Posted by kiton View Post
    The model is:

    y1 = x1 + x1^2 + x2 + x3 + x4 + x1x2 + x1x3

    OLS estimation with continuous DV (y1 - rating score 0-100, N=899) gives a perfect distribution of the residuals and passes the normality test. But OLS estimation with DV that I am trying to use for robustness check (i.e. y2 - a newly generated RANK based on the rating score 1-899, N=899) gives a "slightly messed up" distribution of the residuals. The significance of the estimated coefficients is consistent in both cases (y2 and ln(y2)), but the R^2's are different. So, I am trying to find relevant support for using ln(y2) in case of robustness check.
    that's one heck of a model to be honest with you. two interaction terms and a quadratic effect? eeew! the interpretation will be complicated. but anyhoo... statistically speaking i still don't see any problems. yeah, the tails of the distribution of the residuals could be better... but your N is almost 1000!!! i mean that by itself puts you in a safe place when it comes to the distributional assumptions.

    and, as i mentioned, if you're using a different DV that happens to not look as linear as the original one, i really don't see any issues to sticking with it. just like noetsi, i'm also wary of this "robustness" word you're throwing around because it seems you're just changing the original variable that you're modelling... in which case i don't think there should be any surprises that you're getting different results from the original analysis. but then again if you're worried about using the natural log of the dependent variable and the normality of the residuals i don't think that, in this case, there is anything to worry about.
    for all your psychometric needs! https://psychometroscar.wordpress.com/about/

  17. The Following User Says Thank You to spunky For This Useful Post:

    kiton (01-26-2015)

+ Reply to Thread

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats