# Thread: GLM: Assumptions and choosing a model

1. ## GLM: Assumptions and choosing a model

Hello all, I have a question that I would be interested in hearing other's opinions/feedback.

I have monthly cost data from 2006 to 2011 (72 total months) and my ultimate goal is to predict monthly costs for 2012. I was originally going to try and do some sort of time series method but since I'm having difficulties fully understanding this topic, I'm going to try and pursue an alternative route via GLM.

Preferrably, I would simply use COST as my variable of interest. However, I'm not quite sure I can argue if the QQ-plot is truly linear (it's not horrible, but it's not as linear as I would normally like) and after I looked at a plot of the residuals vs. predicted, it definitely looked as though some type of fan-shaped pattern was there. Also, the shapiro-wilk test of the residuals came back at p-value < 0.0001. Ultimately, it would seem as though I should perform some type of transformation.

Next, I did a log-transformation of the cost data. The QQ-plot is much more linear, the plot of the residuals vs. predicted still has a slight fan-shape but is considerably better and the shapiro-wilk test yields a p-value=.8546. This would then seem to suggest that building my model on the log(cost) would be the best option and I can just "un-log" the estimated amounts to get estimated costs.

However, as I stated my ultimate goal is to forecast for future dates. Although the log(cost) model does a better job at modeling the actual data, it yields predicted costs that extremely unlikely and what I would guess to be impossible. However, if I proceeded with the original (untransformed) cost data, the predicted amounts are more fathomable and likely true.

My question is this: although all signs would point to the log(cost) model - i.e. checking assumptions would indicate that I should use the transformed model - does anyone think it unreasonable to simply use the orignial data rather than the log(cost) data?

I mean in school, I would likely have used the transformed data since that's what profs/textbooks always said. But in real life, things are not necessarily so black and white and I'm just curious if anyone has any type of opinion on whether it's fair to simply "ignore" the assumptions in favor of the model that will likely produce a more accurate cost forecast.

Thanks and sorry for the long question!

2. ## Re: GLM: Assumptions and choosing a model

Qq plots dont get at linearity. They test normality. Most commonly you use scatterplots to see if the relationship is linear although multivariate linearity is rarely tested (if it even exists) only bivariate linearity. It is common to transform financial data to be more normal with logs (and to address heteroskedacity).

I think you are confusing normality with linearity.

I think predicting future data with existing one outside time series models is pretty dangerous because you violate the assumptions. Most especially it is dangerous if there is a trend to the data, that is the relationship of cost to whatever you are predicting it with is changing over time.

You might want to at least look at exponential smoothing which is pretty straightforward (but assumes past patterns continue).

3. ## Re: GLM: Assumptions and choosing a model

Of course you can “simply "ignore" the assumptions”, but then you can also “ignore” the possibility to infer something from the model.

If everything points to a logged model, in terms of good fit and so on then it is reasonable to use that one.

[How do you choose shoes bye the way? Those with good fit or those with bad fit?]

If the predictions based on logged values give unrealistic results, then I would guess that you have made some mistake in the calculations.

(I don’t agree exactly with noetsi above.)

4. ## The Following User Says Thank You to GretaGarbo For This Useful Post:

CowboyBear (08-22-2012)

5. ## Re: GLM: Assumptions and choosing a model

Do you think they are talking about linearity, or do you disagree on another point?

6. ## Re: GLM: Assumptions and choosing a model

Originally Posted by GretaGarbo
Of course you can “simply "ignore" the assumptions”, but then you can also “ignore” the possibility to infer something from the model.
You are the champion of great soundbites, Greta

7. ## Re: GLM: Assumptions and choosing a model

I think that if the values in a QQ-plot fall on a line then it is approximately normally distributed.

I think that a predictive model, especially in a time series, is done for exactly that of predicting values outside of the estimation region and often into the future. (And to predict is of course not to violate assumptions.)

8. ## Re: GLM: Assumptions and choosing a model

I know QQ plots don't test linearity they test normality - I just meant the QQ plot should be linear if it's safe to assume normality.

I have tried to teach myself time series and have posted many questions on here but am still uncomfortable with proceeding. I would not be able to explain to anybody what the time series model is and I don't fully understand how exactly to choose p or q for an AR(), MA() or ARIMA() model. I don't have any real way to know whether or not I'm doing any time series methods correctly or not.

That's why I proceeded with a GLM. I guess my question is this:

if after checking assumptions, things are a little "iffy" and could really go either way regarding whether or not the assumptions are met, what have others done at that point? Play it really safe and assume the assumptions fail, or go about it a little more liberally and just assume the assumptions have been met?

9. ## Re: GLM: Assumptions and choosing a model

The advantage of time series is that no one understand it, so you really don't have to explain it

When the assumptions fail you have to then decide how robust the method is. There is no agreement on that at all, different authors take different positions and many of the comments are vague (what for instance specifically is "moderate" violation of an assumption, how many cases specifically do you need for a "large" sample size). I have never come across specific guidelines that detail for example given a specific pattern in the residuals that violates the assumptions, how serious is that and what impact will it have on the results. And I spend a lot of time looking for diagnostics ... In honesty I dont think there is any agreement on this type of issue.

Time series is worse, however you analyze it, because there is no way to know if the patterns that exist now will continue. I spoke to an economist recently (several) who said anything beyond five years in the future was totally worthless. I tend to not worry overaly much about most of these assumptions because in practice there is little I can do about them. I just let people know they exist.

10. ## Re: GLM: Assumptions and choosing a model

Yes, I have heard similar things about how pointless it is to try and predict too far in advance (5 years or more). I'm only trying to predict costs for the next 12 months so I'm not too concerned about any issues there.

Regarding the time series, I don't even feel comfortable enough to say one way or the other what parameters to use for an AR(), MA() or ARIMA() model so I little no clue where to even start building model.

These 2 facts coupled together (lack of knowledge about time series and the fact I'm only trying to project a small time into the future) made me think of using a GLM as I am much more familiar with this technique and can actually explain/interpret the model and results.

11. ## Re: GLM: Assumptions and choosing a model

That is certainly fine as long as you know that a central assumption of GLM, no autocorrelation, is like wrong. How much it will be wrong and how much it will influence the results is a question with no certain answer I have ever seen (which is what has been discussed in this thread). I would at a minimum run a Dubin Watson statistic and see how serious the problem is. Also exponential smoothing does not require you to estimate ARIMA parameters or understand time series. It is a totally non-theoretical approach to estimating time series data. As far as I know it has no assumptions at all.

You might want to look at that, as an alternative to GLM if you do find serious violations of the assumptions.

12. ## Re: GLM: Assumptions and choosing a model

I found the residuals then found the lag(residuals) in SAS.

I calculated the numerator as the sum squared difference in residuals at times t and t-1. I calculated the denominator as the sum squared residuals.

I obtained a test statistic of 2.29. I have n=72 time points and k=4 variables. The interval provided on a DW table was (1.54, 1.71). What I read was that since my test stat is higher than the upper bound, I can simply conclude that autocorrelation=0.

13. ## Re: GLM: Assumptions and choosing a model

Well you can conclude that first order autocorrelation is not signficant (not neccessarily 0). That is the only type of autocorrelation Durbin Watson tests for. But that is normally enough.

14. ## Re: GLM: Assumptions and choosing a model

Originally Posted by lancearmstrong1313
a DW table was (1.54, 1.71)
This point to a serious overall misspecification of the model. And also an autocorrelation on about 0.20.

Maybe you could try Holt Winters model with smoothing of trend, season and irregular.

@noetsi, there are advanced variants of the smoothing models. Andrew Harveys models with the Kalman filter for example.

15. ## Re: GLM: Assumptions and choosing a model

Originally Posted by lancearmstrong1313
I know QQ plots don't test linearity they test normality - I just meant the QQ plot should be linear if it's safe to assume normality.
Could you explain why you think this is the case? (I'm not saying you're wrong, I just don't follow this)

I don't fully understand how exactly to choose p or q for an AR(), MA() or ARIMA() model.
If you're feeling particularly empiricist you could also use the auto.arima function in package forecast in R to select the "best" model per the AIC or BIC.

No significant first order autocorrelation =! no first order autocorrelation
No first order autocorrelation =! no autocorrelation

Maybe something like the Ljung-Box test would be helpful to do a more omnibus test for the presence of autocorrelation up to a limited number of lags - or maybe even better, look carefully at an ACF and PACF.

16. ## Re: GLM: Assumptions and choosing a model

Originally Posted by CowboyBear
Could you explain why you think this is the case? (I'm not saying you're wrong, I just don't follow this).
One of the assumptions to perform regression type analyses is that your sample comes from a normally distributed population. If the QQ plot is linear (i.e a diagnonal line) then you can assume normality. If it looks like a hockey stick (or some type of curved shape) then perhaps it came from a different distribution (like exponential).

One of the problems I'm having is that I have found many different ways to read and interpret the PACF/ACF charts. Also, I'm not sure if seasonality exists within my data (there is no obvious pattern to the data that would suggest seasonality) but I'm still not sure of a formal way to test for this - so I'm not sure whether Holt-Winters would be justifiable (isn't it for seaonal data?)