Multicollinearity in Time Series Data

seanodlum

New Member
Hello all, this is my first post here -- hoping to find some helpful advice, and hope to dispense some in the future.

Here is my situation: I am building a model to attribute registrations on a web site driven by television advertising. I am comfortable with parsing the outcome variable (i.e., how many registrations are driven by TV, as opposed to online advertising, word of mouth, etc.) -- where I'm struggling a bit is with the attribution "within" TV.

Currently I have my data in time series format, with each 15-minute period constituting an observation, with registrations in those 15 minutes the outcome. In addition to controls (day of week, etc), the predictors are thousands of impressions on different television stations. To allow for some latency, I'm also including several lag terms, so for example the (much simplified) regression equation would look like:

ln(TV Registrations) =$$alpha + beta*STA1 + beta*STA1_lagged + beta*STA2 + beta*STA2_lagged + epsilon$$​

This produces a decent, usable model for understanding the response driven by different TV stations. However I'd like to improve the model by accounting for accumulated impressions over a longer period. For example, when we go dark on TV for a week, we still see a significant baseline of TV registrations coming in, I presume as a result of having been on air in the preceding weeks.

To incorporate this into my model I've tried adding terms to the above equation representing the total cumulative impressions on the station over the preceding two weeks, coming up to (not overlapping with) the oldest lag term. The problem I'm running into is that this introduces serious multicollinearity problems (VIFs in the 100-200s), which means I can't trust my t-stats.

An added complexity -- which I can't say I completely understand -- is that there are no very high pairwise correlations among these predictors. Still, multicollinearity is a clear problem.

I have thought about possibly combining some of these stations (there are about 50) so there are about 8-10 predictors rather than 50, but I'm not even sure this will solve the problem.

Any ideas? I would greatly appreciate any guidance.

Thanks,
Sean

chetan.apa

Member
Your lagged variable corresponds to impression 15 mins ago, right? If that is the case you should introduce more lag variables.
I can tell you from my experience in a slightly different scenario where the customer physically buys the product from the shop , we consider 1 day lag, 1 wk lag and 2 wk lag effects of a TV advertisement leading to sale to the product.

seanodlum

New Member
Thanks for your reply. Yes, my lagged variable corresponds to impressions 15 minutes ago, and in fact I have 15 lag terms so I'm covering 4 hours in total.

I tried a different model specification where I have these lag terms, and then I have total impressions from 1 day prior (day n-1), total impressions from one week prior to that (n-2 through n-8), and one week prior to that (n-9 through n-15). My multicollinearity is less severe now, but still problematic: VIFs for my prior week variables in the 20s and 30s.

Any suggestions?