Comparing fit of GLM to OLS regression

#1
Hello talkstat.

First post in this magnificent forum!

In my master thesis I'm estimating health care expenses to patients, who has experienced a occupational injure on administrative data.

The depended variable is total health care expenses for a given individual a year after the injure.

Due to typical right skewness in the depended variable I have estimated two models; the first is GLM model using log-link function and a gamma distribution for the depended variable. The mean expenses are estimated to around DKK16,000 (approximately $2200) for the treatment group (people who has experienced a occupational injurie) and around DKK5,000 for the control group.
As an alternative to the GLM model, I have also estimated the expenses using a OLS-regression with logarithm transformed depended variable. After using a Duan-smearing factor and "retransforming" the estimate, I obtain an estimate for the predicted expenses of DKK15,000 and DKK4,000 for the treated and the control group.

My question is; how can I compare for the fit of the models and choose the "best" one or at least the better one? Both with regard to a graphically and/or test.

Steinberg
 

kiton

New Member
#2
There are several ways to explore the model fit. Easiest would be to compare the AIC and BIC -- smaller values indicate a "better" fit. Next, you can examine the R^2 -- higher values are desirable. Finally, you can predict the residuals and run a Q-Q plot to compare their distribution (you can top it with some formal test, say, Jarque-Bera and see which model's residuals have a smaller chi-square statistic). Additionally, I'd consider comparing the standard errors to see which model provides more efficient ones.

On a side note, have you checked if your models satisfy the required assumptions? I am asking because if, say, you have not met the OLS assumptions (at minimum: normality of residuals, lack of heteroskedasticity and multicollinearity) then what is the purpose of comparing its estimates with other estimators.
 
#6
All right.

So I'm left with AIC?

Yet, I can't just pick a model by only looking at the AIC, right? I mean using the "relative likelihood", the exponential of the mean value of the "distance" from the two "best" (i.e. lowest AIC) models, I get a really (I mean REALLY) low value.

What about the assumption of the variance being equal to the squared mean? When I calculate the squared mean of the expected value, I don't even get close to the variance. Is that an argument against the GLM-model?
 

kiton

New Member
#7
I may be missing something here with the distributions' nuances, but let me elaborate on the following. So, I estimated two simple models using (A) Gaussian family and identity link (default), and (B) Gamma family and log-link -- while the estimated coefficients differ substantially, the residual distribution seems to be identical. The JB test results capture only a minor difference in the chi-squared statistic between the two. As such, is it not plausible to approximate the model fit of a GLM with gamma-log with its residual distribution?
 
#8
I may be missing something here with the distributions' nuances, but let me elaborate on the following. So, I estimated two simple models using (A) Gaussian family and identity link (default), and (B) Gamma family and log-link -- while the estimated coefficients differ substantially, the residual distribution seems to be identical. The JB test results capture only a minor difference in the chi-squared statistic between the two. As such, is it not plausible to approximate the model fit of a GLM with gamma-log with its residual distribution?
But what I've not used a Gaussian regression, I've used an ordinary least squares regression. From what I understand the reason that there is no R-squared statistics for GLM is that it may be non-linear.

What about the assumption that the variance is equal to the predicted value squared? What if it does not hold?
 
#9
Shouldn't something be said about the suspicion that the fluctuation is equivalent to the anticipated quality squared? Imagine a scenario in which it doesn't hold.