Hello everyone. I have quite a few questions and would like some help/advice or a general pointing i the correct direction.
I have a dataset that has every home sold over the last 3 years and it's sale price, along with it's location (MSA) its number of beds, baths, and sqft.
I want to build a model that would 'forecast' the house price given this information. Back in school, if this was cross sectional data I would simply do an OLS regression analysis. However because of the time component I don't believe a pure regression on just the variables MSA,beds, baths, and sqft would make snse. (Because a house with the same properties will vary in price over time.... Aka a 2 bedroom, 1 bath, 1,000 sqft house in the same neighborhood was worth much less 10 years ago than it is today). In other words we would have significant omitted variable bias in terms of the "time" variable.
**Question #1:** Am I wrong in this assumption? Can one do a regular regression on data collected over time? (In what conditions can one do this? What if the time series data is stationary).
Because of this I am not sure what to do. I have had the following thoughts, but have some concerns with each. Could anyone help me determine if my thoughts are correct, and any problems with them?
**Method #1:** Ignore the other variables and simply do a time series model. I wouldt ake the average sales price per "month" from a given MSA, and do a ARIMA model to forecast future housing prices.
**Why I think this:** Since my dataset is at a lower level than MSA, I don't think an ARIMA model with additional independent X variables would make sense.... For Example: If I wanted to include SQFT, at the level of MSA would I model the average square foot? That doesn't really help when I have home level sqft foot information.
**Problems with this method:** I am basically throwing away the information on SQFT and BEds, etc.. which seems wasteful.
**Question #2:** Does this make sense as a model for what I am trying to do given the data I have?
**Method #2:** My other idea was to calculate the average home price of each MSA, and do this for various lagged time values. Then I could make these into independent variables in the regression model. AKA if i'm trying to forecast the home price in Jan 2015, i would make a variable for "dec 2014 average msa house price","Nov 2014 average msa house price" etc..
This way I am basically taking time series information similar to doing Y(t-1) in a typical autoregressive time series model, but to a lower level. Doing this I could also include the other variables, Sqft, house, bath, etc..
**Question #3:** Does my above procedure make sense? Can one take a "time series index" and apply it to lower level (home) data? Basically it would be similar to an AR(p) model, but from data at a higher level. I can't see why this would be a problem.
Thanks all and any help is appreciated!
Tweet |