Violating Linear Regression assumptions

#1
I have a dataset in which I am performing linear regression with multiple covariates. The goal of the analysis is to identify important covariates (categorical and continuous) which have an important effect on the response. This is a natural resource dataset so it is not as clean cut as other information I have worked with (lots of noise). My issue has been meeting the assumptions of autocorrelation and normally distributed residuals.

My first tactic was to eliminate every other observation to negate autocorrelation (which worked) and then eliminate outliers of the response until normal distribution is achieved in the residuals. Unfortunately, those outliers I eliminated contain valuable system relationships that are not present in the analysis anymore (i.e. a large increase in the response based on actions taken at a few points). Would it be practical to violate the assumption of normally distributed of residuals when the outcome makes since in the system and we are trying to identify important system covariates? Also any thoughts about autocorrelation would be great too.
 
Last edited:

hlsmith

Not a robit
#2
Do you know the underlying cause of these concern? The more you alter the data set the less transportable your results will be.
 
#3
Are you talking about the concern of eliminating outliers? My goal is to not alter the data as much as possible but the non-normality of the residuals concerned me that I would be biasing the results if I did not eliminate a small set of outliers. If not meeting the normality assumption is ok in this situation, I would rather use all the data to make the results more transportable.
 

hlsmith

Not a robit
#4
No, I was referencing the source of autocorrelated errors (e.g., time series data etc.). You can transform data to try and normalize the errors, if that is a concern.


See the following paper, which describes autocorrelated errors and correcting them. I have never had this issue, so I do not have first hand experience with the correction method.


http://www.lexjansen.com/nesug/nesug03/st/st003.pdf
 
#5
the underlying cause of the autocorrelation is the spatial relationships between where data was taken from. So data taken from locations closer to each other are more similar than data taken from farther away.
 

hlsmith

Not a robit
#7
Hint, if you say yes I am going to ask you to think about multi-level modeling which is designed for related data (i.e., clusters).
 

hlsmith

Not a robit
#9
kiton,


Thanks, this is really interesting topic. As I mentioned above, I have not experienced this issue. Though, if I was not so busy, I could spend the whole day reviewing the topic. I actually ran an inventory in my head of my projects to see if I could come up with an excuse to just inundate myself in this area for the day. Nope.


I wasn't thinking of this in particular, but indirectly this is why I asked the OP the potential mechanism, so a developed counter approach could be potentially employed.