I have a large dataset with many subject each with responses from a consecutive year going back 10 years (ie 100,000 persons per year (not necessarily 10 data points per person as they may not have been part of the study in prior years) dating back 10 years).
I have data on each specific person that varies within each year, and data that specifically varies with just each year, meaning that each of the 100,000 person in the same year will have the same value if they belong to a certain year(and that value changes with each year). Essentially I have just 10 values in that variable.
What complicates this is I have multiple variables like this (housing values, opinion surveys, etc) that have a significant impact on how that individual behaved given that year. Also I have made a few lag variables for each of those variables that is a change in value from a specific time in the past (ie change in home value from 1 yr ago, 2 years ago, 4 years ago, etc).
I am running into a linear dependence issue with some of these variables, and when I run the regression, the model statistics are the same, but I can get very different results as well as predictions on test data.
What is the best way to handle these types of variables by transformations, modeling techniques, or other methods?
I have data on each specific person that varies within each year, and data that specifically varies with just each year, meaning that each of the 100,000 person in the same year will have the same value if they belong to a certain year(and that value changes with each year). Essentially I have just 10 values in that variable.
What complicates this is I have multiple variables like this (housing values, opinion surveys, etc) that have a significant impact on how that individual behaved given that year. Also I have made a few lag variables for each of those variables that is a change in value from a specific time in the past (ie change in home value from 1 yr ago, 2 years ago, 4 years ago, etc).
I am running into a linear dependence issue with some of these variables, and when I run the regression, the model statistics are the same, but I can get very different results as well as predictions on test data.
What is the best way to handle these types of variables by transformations, modeling techniques, or other methods?