Regression when outcome is a rank?

zeloc

New Member
#1
Is there a particular kind of regression if the outcome is a ranking? I'm wondering if my best bet would be to use the central limit theorem and use a linear regression. I've been using SAS for my data analysis.

Thanks!
 

trinker

ggplot2orBust
#2
zeloc,

I see you're a pretty new member. Welcome aboard :) Well you haven't given much to go on. Since you're new I suggest having a look at Posting Guidlines (especially 5 and 6). This will help you get better responses.

You've limited us to regression but if this isn't necessary perhaps a simple spearman rank correlation would be useful (or kendall's). But we don't know enough about your data yet to move forward.
 

zeloc

New Member
#4
Interesting, I see that there are possibilities besides regression which I didn't consider.

I have data on 100 subjects that I was planning on ranking 1-100 (graduates of an educational program). There are a variety of predictors including ordinal and continuous. I am considering building a predictive regression model to predict desirable candidates based on information that was available before admittance. It is necessary to rank all prospective candidates so I cannot change the outcome although if something else was produced I suppose I could convert the information to a ranking, but there is no real quantitative outcome that I could invent because there are a lot of intangibles. I didn't consider possibilities other than regression although I would be interested in knowing what they are, but at least conceptually I am still curious what is the best way to work on this from within regression. Linear seems like an option but since the data is obviously nonnormal I would have to invoke the central limit theorem. Thanks for your feedback, hope this clarifies.
 

trinker

ggplot2orBust
#5
I have data on 100 subjects that I was planning on ranking 1-100 (graduates of an educational program).
It is necessary to rank all prospective candidates so I cannot change the outcome although if something else was produced I suppose I could convert the information to a ranking, but there is no real quantitative outcome that I could invent because there are a lot of intangibles.
Where is the information coming from to rank these students? Are you going on intuition? Why generate one rank score? The model would be a bit bigger but by including all these pieces of your DV you get more information about how each of the predictors relates to the outcome(s). Generalized linear models could work including the possibility of logistic regression and you'd loose less informationabout the relationships between predictors and outcomes. Are you set one generating one rank score?
 

zeloc

New Member
#6
Yes, unfortunately set on a ranking because we are part of a national association and must submit rank lists of all prospective candidates.

I have worked extensively (as in 10-hr workdays for several yrs) with the majority of individuals and have observed all aspects of their performance. There are some who are truly phenomenal and some that are okay so it will not be difficult to come up with a fairly good ranking although toward the middle and bottom there might be some more ties. I suppose another possibility would be to just identify the really stellar candidates (definitely fewer than 10 out of the 100), and if a model could do this then someone could manually go through however many are selected in the prediction and manually rank them.

Since it will be used as a predictive model I am not really interested in knowing the relationship between the various predictors and the outcome, rather the overall model is more important. If using a regression this will also be much more convenient because I don't have to worry about confounding, etc.

Logistic regression is an option, but no easier than doing a linear, and since the outcome is going to be a ranking, linear would be better. So if within the bounds of regression, I am still wondering if there is a procedure for a ranking to be the outcome or if linear regression invoking the central limit theorem is the best.
 

trinker

ggplot2orBust
#7
No logistic is not any easier than linear but ranking is most likely not appropriate for linear. When I suggested GLMs this gives alot of appropriate tools to choose from Simple logistic regression is most likely not appropriate either but ordered logistic regression would be (LINK).

May I ask why you want linear regression? Is it because of unfamiliarity with other techniques or lack of access to programs that could do the analysis? If this is the case we could point you to free resources for both direction and programs to run the test.
 

noetsi

Fortran must die
#8
If you have a hundred distinct ranks (that is a hundred different levels the dependent variable can take on) then logistic regression is not a good idea. While it can be run (probably with ordered logistic regression) it would be incredibly difficult to analyze the results. It is commonly suggested that when you have an ordered variable with more than 5 distinct levels you consider it interval like (some suggest 9 levels is the minimum advice differs).

But perhaps I misunderstood how the DV is going to be coded.
 

spunky

Super Moderator
#9
If you have a hundred distinct ranks (that is a hundred different levels the dependent variable can take on) then logistic regression is not a good idea.
that is an excellent point... besides, the demands on the sample size would be tremendous if you need to distringuish among 100 ranks (i'm gonna say if you get half of the population of the PR of China you may be ok).

if i may, i would look at this problem in a different way. so you are ranking people... to rank people you need to have assessed their performance in some way. what are these performance assessments? are they tests like you'd do an exam and get a grade? reaction times, maybe (whoever finishes first gets the highest grade or something)? a more historical performance? i would work on the results before the rankings and, once i get a prediction on the performance, i'd rank accordingly.

ps- linear regression on ranks is not a great idea. for instance, what's gonna happen when people get predicted rankings that are either below 1 or above 100? it could well happen in linear regression. or if you get someone who's predicted ranking is 50.001 and another one is 50.0015... who gets 50 and who gets 51? i think a lot of things stop being meaningful and the ranking would rely more on your judgement than on your analysis... in which case why run the analysis on the first place, right?
 
#11
Thanks for all the responses, some answers to the questions:

May I ask why you want linear regression? Is it because of unfamiliarity with other techniques or lack of access to programs that could do the analysis? If this is the case we could point you to free resources for both direction and programs to run the test.
It's not that I want to do linear regression, my original question was what kind of regression to use, since there are so many kinds (linear, quantile, logistic, cox, negative binomial, Poisson, etc.) I thought maybe there was a type for rank as an outcome. In the absence of this it seemed to me that linear regression would be the closest to my outcome, and it seems that I could invoke the CLT although I'm not convinced that a sample size of 100 is enough. I agree that ordinal logistic is not an option.

I'm not familiar with generalized linear models, if this would be appropriate I can look into it more.

if i may, i would look at this problem in a different way. so you are ranking people... to rank people you need to have assessed their performance in some way. what are these performance assessments? are they tests like you'd do an exam and get a grade? reaction times, maybe (whoever finishes first gets the highest grade or something)? a more historical performance? i would work on the results before the rankings and, once i get a prediction on the performance, i'd rank accordingly.

ps- linear regression on ranks is not a great idea. for instance, what's gonna happen when people get predicted rankings that are either below 1 or above 100? it could well happen in linear regression. or if you get someone who's predicted ranking is 50.001 and another one is 50.0015... who gets 50 and who gets 51? i think a lot of things stop being meaningful and the ranking
I don't quite understand your first point. The results have already been completed. I can't reduce it to a single factor (like taking a test or reaction time as you mention) because I guarantee that the ranking is not going to be in order of test scores, or education, or personal qualities, etc. It is a combination of factors that are being looked for. And the ranking I produce has easy re-test reliability if any of several others did the ranking.

To the second point it doesn't matter if one is 50.001 and 50.0015, this can all be converted into an ranking. I don't think the results are going to be more relied on by judgment instead of the regression. If the 2 candidates are that similar it's not going to be any easier for a person to make a decision, and especially considering their are thousands of applicants it would be much easier for a regression. Of course I could look at the r-squared or another measure to see whether the regression can figure the problem out.

I still don't see why OLS would be ruled out, are you saying that the distribution is just too nonnormal for the CLT?

Is CART an option?

The other question is, if there is no optimal approach, what would be the best? Thanks for everyone's feedback.
 
#12
A thought just occurred to me. The outcome is probably going to be roughly normally distributed in the sense that there are a few phenomenal candidates, a few bad candidates, and most in-between. After I create a ranking of the outcome variable, can I convert it into a normal distribution and then run a regression? This should give me maximum power in a regression analysis. I can then do a linear regression and convert the data back into rankings.

How would I do this? It would seem that I would need to choose a mean and SD, the mean is arbitrary but how would I choose a SD?
 

noetsi

Fortran must die
#13
You should start by testing if it is (roughly) normally distributed. Look at QQ plots in SAS or run skewness and kurtosis analysis.
 
#14
Thanks for the message. It seems to me that the outcome of a ranking is incorrect. The only reason I was going to do a ranking was because it would be easy to do so, but certainly it doesn't fit the data because there isn't an equal amount of distance between each outcome. So rather than run a regression and then check for normality in the residuals, I think I will do a transformation first and then run a linear regression. I found the following page:

http://www.psych.cornell.edu/Darlington/transfrm.htm#median

Does anyone know how I would apply one of these procedures in SAS?
 

noetsi

Fortran must die
#15
What data are you comparing your ranking to in the above statement? To the independent variables or to some other measure(s) of the dependent variable that you created rankings from? If you have a set of measures of the dependent variable than the first thing you should consider is if you can use these to create an interval dependent variable. If you can this is the best solution. For example you can often collapse seperate likert scales (which are not interval) into one variable that is effectively interval. Then you could run OLS
 
#16
The ranking is the outcome variable (DV) and would be compared to the independent variables. The outcome was not created from the independent variables in any sense.

In order to get more power, I think the simplest would be to just convert the rankings into a normal distribution and run the regression on this. The link I posted above sounds like it describes exactly what I want to do but I'm not sure how to do this in SAS.