# Dealing with linear dependent variables

#### ccooper

##### New Member
I have a large dataset with many subject each with responses from a consecutive year going back 10 years (ie 100,000 persons per year (not necessarily 10 data points per person as they may not have been part of the study in prior years) dating back 10 years).

I have data on each specific person that varies within each year, and data that specifically varies with just each year, meaning that each of the 100,000 person in the same year will have the same value if they belong to a certain year(and that value changes with each year). Essentially I have just 10 values in that variable.

What complicates this is I have multiple variables like this (housing values, opinion surveys, etc) that have a significant impact on how that individual behaved given that year. Also I have made a few lag variables for each of those variables that is a change in value from a specific time in the past (ie change in home value from 1 yr ago, 2 years ago, 4 years ago, etc).

I am running into a linear dependence issue with some of these variables, and when I run the regression, the model statistics are the same, but I can get very different results as well as predictions on test data.

What is the best way to handle these types of variables by transformations, modeling techniques, or other methods?

#### hlsmith

##### Less is more. Stay pure. Stay poor.
So the variables are multi-collinear and/or have mathmatical coupling?

#### ccooper

##### New Member
So the variables are multi-collinear and/or have mathmatical coupling?
They are multi-collinear in that I have only a few unique values for each variable and I have a large data set. I would like to consider the variables as continuous, as the they are percentages and the change in value has meaningful effect on the dependent variable. Furthermore, I have several variables like this.

In my case, I am trying to predict if an individual will vote. I have lots of individual records from 10 years worth of elections. I also have some public opinion polling as well as economic data that was recorded at time of voting. Because I have only a hand full of different values and there are only a finite years worth of data, these variables tend to be collinear. The next set of issues I am trying to work through is that within each of the polling and economic data variables, I have calculated a few extra variable that has the change over time for those opinions/economic data. IE, change in approval from this year to last, change in home value from this year t last, etc.

These variables, when I move them in or out of the analysis, tend to show a big increase/decrease in my test/predicted data set values. So I want to find the best variables the give the best predictions, but not sure if I need to be reducing/removing these variables or performing dimension reduction on them. If so, what would be the best techniques to do this given how the variables are assembled into the data?

#### ccooper

##### New Member
Any thoughts? Should this be in a different area on the forums?