I have a related question. I have a personal theory I am testing for my agency that how much the median wage is in a given county determines how high an income you make when we place you (we is a state agency that finds people jobs in the various counties).
I have an ordinal scale that ranks the county from highest income to lowest by median income (there are 67 in my state). I am trying to decide how to use that measure. One possibility is to just use 67 values, but the regression will assume this is an interval measure when it is not (here it is clearly ordinal unlike my other question) . Another is to build a dummy or set of dummies, but I have never seen a discussion how to build dummies in this case. Do I do it bottom half, top half of counties (one dummy). Top ten percent bottom ninety percent....
Nothing in the literature I have seen addresses how you should split the data if you build dummies in Vocational Administration (or anything actually that I have read).
You can still analyze it as continuous data, but it may show up as "chunky" on normality or residual plots. This can throw off the p-values in a normality test even though all the data points fall along a straight line. It may violate all sorts of assumptions and prevent you from publishing in a journal, but I have built many perfectly useful models with it.
I thought to be continuous it had to actually have a certain number of values not just theoretically have an infinite level of values. So if you only had say 40 distinct levels the data could not be continuous. Which sound now like a bad assumption on my part.
Miner this is not for publication. It is for work - thanks for your comments about building useful models with this. I have 20,000 or so data points so I doubt it will have huge impact on p values. Technically I have the whole population not a sample (although one can argue I guess that it is a subsample of what could occur in the future). It is doubtful p values even apply in this analysis although I will use White standard errors in any case given your comments. I look at the residuals and other tests for violations of the assumptions.
But with so much data non-linearity is generally the only assumption I really worry about.