The first way seems clumsy to me too. Why not just calculate the probability the temperature exceeds 20 degrees under the assumption of normally distributed errors? Why only use the point estimate?
Suppose you're a meteorologist and want to predict whether tomorrow the temperature will exceed 20 degrees. Your boss gives you data [t, x] where t is temperature and x is a matrix of relevant temperature-predicting variables.
One approach would be to simply use OLS to predict t. If t > 20 then you at least know you think there's at least a 50% chance the temperature will exceed 20 tomorrow. Post-estimation you could probably create confidence intervals surrounding your predicted t value and then get a probability estimate.
Another approach is to generate a variable g which is 0 if t < 20 and 1 if t > 20 then run a probit of g on x. That will directly give you the probability estimate you're looking for.
While the second approach seems easiest, it feels a bit clumsy to me and I wonder how reliable the estimates are.
Any thoughts on this?
Thanks!
aboluk
The first way seems clumsy to me too. Why not just calculate the probability the temperature exceeds 20 degrees under the assumption of normally distributed errors? Why only use the point estimate?
I don't have emotions and sometimes that makes me very sad.
I forgot to mention my data is somewhat right skewed, this is why I am leaning towards probit
... how exactly does having skew lean you toward probit?
I don't have emotions and sometimes that makes me very sad.
Because I don't care how far above the threshold the temperature is, I just want to know if it's above it.
When I use OLS I am getting estimates that are too high.
I have one explanatory variable "x1" that is by far the main determinant of temperature. My OLS predictions are consistently higher than the median temperature when I break it down by x1.
For example, here are some raw numbers from the data
x1,threshold,% above threshold for given x, OLS prediction
8,27,45%, 27.4
9,28,46%, 28.3
10,29,45%, 29.6
11,30,46%, 30.3
Therefore the OLS predictions are saying given x1=8 it is more likely temperature exceeds 27, but simply looking at the raw data, for x1=8 the temperature only exceeds 27 45% of the time. As you can see I am not very confident in OLS here.
Some points that people might want to know:
1. Sample sizes for each x1 are over 2000 so sample size is not an issue
2. The other x's are basically insignificant so there isn't some x2 that's pushing the OLS predictions up so far
Maybe saying the data is right skewed isn't quite accurate -- I have some extreme outliers on the upper end of the temperature scale that don't exist on the lower end.
Can you provide a histogram for the response variable given x1=8 or something like that? It might make sense to use a generalized linear model (which probit regression is a special case of) to model the actual response variable - then using that you could calculate probabilities.
But if you don't like OLS then I'm not sure probit is necessarily the best route. It might work but there is a certain interpretation of probit regression that leads me to believe that it isn't necessarily appropriate here. Essentially one way to think of probit regression is that there is some latent variable that conditioned on the predictors follows a normal distribution. We don't get to see that variable - we just get to see whether or not it exceeds a certain cutoff. This interpretation fits pretty darn well with how you would actually go about fitting the probit model - but you already said you don't like the OLS too much which doesn't necessarily rely on normally distributed error terms but we get the same results if we do assume normally distributed residuals so really... I'd say given your concerns probit isn't the route you want to go.
But like I said a generalized linear model might make sense using some other response distribution.
I don't have emotions and sometimes that makes me very sad.
aboluk (07-04-2012)
Thank you guys for your replies
Tweet |