Impact of heteroscedasticity and non-normality on utility of regression result?


New Member

I am using least squares 2nd order polynomial regression to determine the effect of a single independent variable (X) on a dependent variable (Y) (i.e. I am interested in the trend between X and Y). Both X and Y represent properties of radar satellite measurements.

Unfortunately, my Y's are not evenly distributed across the range for X.
I have more Y's paired with high than small X-values. Also, it is a given that Y is highly affected by variables other than X. For example, air temperature, wind speed, target wetness. I have no data on any of these to use them in multiple-regression. As can be expected, my plots of Y versus X show signs of heteroscedasticity; this is confirmed by testing. In addition, tests show that the distribution of the
residuals is not fully normal. Testing for both done at 5% level.

From what I read, least squares regression is quite robust for the violations present in my data set. The resulting estimates for the regression coefficients apparently are unbiased. In practical terms, I figure this means my regression line reflects the true trend between X and Y and can be used to estimate how Y changes with X (in the event all other affecting variables are constant).

The impact of severe violations of the homoscedasticity and Normal residuals appears to be limited in the sense that it renders the standard error (SE) and significance of the regression coefficients suspect/incorrect. I would think that exactly the same is true for the SE and p-value of the overall regression model/the estimator but again, I'd appreciate CONFIRMATION!

My main question is this ...
What exactly is the impact of the noted violations in terms of the utility of my the regression model??
Some indicate that the model will not be the BEST possible model. However, does this mean 'not the best possible 2nd order polynomial model that uses my X' or 'not as good as a model that e.g. be 3rd order or include X plus other independent variables'??
Also, what is the implication in terms of being able to use the model produced using the available data set in support of the analysis of other similar but different (e.g. later date) data sets?? Are there other practical implications? Should I even care if the model proofs to be useful?

Clearly lots of questions; I would appreciate any advice/insights you are willing to share!

Thank You