Hello all, I have a question that I would be interested in hearing other's opinions/feedback.

I have monthly cost data from 2006 to 2011 (72 total months) and my ultimate goal is to predict monthly costs for 2012. I was originally going to try and do some sort of time series method but since I'm having difficulties fully understanding this topic, I'm going to try and pursue an alternative route via GLM.

Preferrably, I would simply use COST as my variable of interest. However, I'm not quite sure I can argue if the QQ-plot is truly linear (it's not horrible, but it's not as linear as I would normally like) and after I looked at a plot of the residuals vs. predicted, it definitely looked as though some type of fan-shaped pattern was there. Also, the shapiro-wilk test of the residuals came back at p-value < 0.0001. Ultimately, it would seem as though I should perform some type of transformation.

Next, I did a log-transformation of the cost data. The QQ-plot is much more linear, the plot of the residuals vs. predicted still has a slight fan-shape but is considerably better and the shapiro-wilk test yields a p-value=.8546. This would then seem to suggest that building my model on the log(cost) would be the best option and I can just "un-log" the estimated amounts to get estimated costs.

However, as I stated my ultimate goal is to forecast for future dates. Although the log(cost) model does a better job at modeling the actual data, it yields predicted costs that extremely unlikely and what I would guess to be impossible. However, if I proceeded with the original (untransformed) cost data, the predicted amounts are more fathomable and likely true.

My question is this: although all signs would point to the log(cost) model - i.e. checking assumptions would indicate that I should use the transformed model - does anyone think it unreasonable to simply use the orignial data rather than the log(cost) data?

I mean in school, I would likely have used the transformed data since that's what profs/textbooks always said. But in real life, things are not necessarily so black and white and I'm just curious if anyone has any type of opinion on whether it's fair to simply "ignore" the assumptions in favor of the model that will likely produce a more accurate cost forecast.

Thanks and sorry for the long question!

I have monthly cost data from 2006 to 2011 (72 total months) and my ultimate goal is to predict monthly costs for 2012. I was originally going to try and do some sort of time series method but since I'm having difficulties fully understanding this topic, I'm going to try and pursue an alternative route via GLM.

Preferrably, I would simply use COST as my variable of interest. However, I'm not quite sure I can argue if the QQ-plot is truly linear (it's not horrible, but it's not as linear as I would normally like) and after I looked at a plot of the residuals vs. predicted, it definitely looked as though some type of fan-shaped pattern was there. Also, the shapiro-wilk test of the residuals came back at p-value < 0.0001. Ultimately, it would seem as though I should perform some type of transformation.

Next, I did a log-transformation of the cost data. The QQ-plot is much more linear, the plot of the residuals vs. predicted still has a slight fan-shape but is considerably better and the shapiro-wilk test yields a p-value=.8546. This would then seem to suggest that building my model on the log(cost) would be the best option and I can just "un-log" the estimated amounts to get estimated costs.

However, as I stated my ultimate goal is to forecast for future dates. Although the log(cost) model does a better job at modeling the actual data, it yields predicted costs that extremely unlikely and what I would guess to be impossible. However, if I proceeded with the original (untransformed) cost data, the predicted amounts are more fathomable and likely true.

My question is this: although all signs would point to the log(cost) model - i.e. checking assumptions would indicate that I should use the transformed model - does anyone think it unreasonable to simply use the orignial data rather than the log(cost) data?

I mean in school, I would likely have used the transformed data since that's what profs/textbooks always said. But in real life, things are not necessarily so black and white and I'm just curious if anyone has any type of opinion on whether it's fair to simply "ignore" the assumptions in favor of the model that will likely produce a more accurate cost forecast.

Thanks and sorry for the long question!

Last edited: