Heteroscedastcty with large number of cases

noetsi

Fortran must die
#1
I ignore here that I have populations commonly :p

I have heard that heteroscedastcity is not an issue when you have large sample sizes (I have ten thousand plus cases normally) because even if it exists the statistical test will be asymptotically correct. But I have also read that this is not the case, because heteroscedastcity influences the assumed distribution it will invalidate statistical test even with large sample sizes.

It would be nice for statisticians could agree on something :)
 
Last edited:

rogojel

TS Contributor
#2
Re: Heterossedastcty with large number of cases

I am just a practioner - but since I've seen the generalized least squares I keep wondering why we do not just switch to them and deal with the variance structure directly?
 

noetsi

Fortran must die
#3
Re: Heterossedastcty with large number of cases

The suggestion I commonly follow is to use White SE. But I am not honestly sure you even need to do that with so many cases - which is really the point of this thread. There is disagreement whether hetero has an impact with very large sample sizes.

Incidentally R A Fischer should be shot for using this word, one of the hardest to spell in the entire English language. I usually just call it hetero.

Of course he is dead and a statistical legend so its probably moot.
 

rogojel

TS Contributor
#4
Re: Heterossedastcty with large number of cases

The suggestion I commonly follow is to use White SE. But I am not honestly sure you even need to do that with so many cases - which is really the point of this thread. There is disagreement whether hetero has an impact with very large sample sizes.
The problem with this would be the need to explain clearly, every time, why you think heteroskedasticity is not an issue for the particular data set. It might be a combination of the size of the dataset and magnitude of the variance variation, so, probably there would be not a clear-cut decision.

My guess is that it would be easier and less controversial to have a standard procedure including addressing the variance structure and to always do the analysis this way.

Incidentally R A Fischer should be shot for using this word, one of the hardest to spell in the entire English language. I usually just call it hetero.

Of course he is dead and a statistical legend so its probably moot.
I do some trainings in simple statistics and I always use the word to get some laughs :) like telling trainees to use it if they want to reeeeallly show off.

regards
 

hlsmith

Omega Contributor
#5
Couple of comments: does whites test provide a test statistic that can be directly interpreted. What I am getting at is can you look at its effect size per se to possibly get around large sample size . So you could say that the size is actually fairly small.

I am similar to Noestsi in that if I think there is a threat I use sandwich estimators. For me this is because I typically only run about one linear regression model a year. It was my understanding you need to have sometheories on cause of heteroskedasticity to appropriately use GLS. Rogojel, what approaches do you usually use when applying GLS?
 

rogojel

TS Contributor
#7
It was my understanding you need to have sometheories on cause of heteroskedasticity to appropriately use GLS. Rogojel, what approaches do you usually use when applying GLS?
Hi,
I pretty much follow the recommendations of Zuur et. al

https://www.amazon.de/Effects-Extensions-Ecology-Statistics-Biology-ebook/dp/B008CLYMQW/ref=sr_1_2?ie=UTF8&qid=1476615358&sr=8-2&keywords=Zuur

check the residuals, have some theory about the source of heteroskedasticity and build a model accordingly, check the new residual pattern and repeat.

regards
 

hlsmith

Omega Contributor
#8
So you initially assume random effects than try different variance structures then replot the residuals and see if they look better?

The first part is similar to what I do with mixed models, then look at AICC.
 
#10
Hi,
I pretty much follow the recommendations of Zuur et. al

https://www.amazon.de/Effects-Extensions-Ecology-Statistics-Biology-ebook/dp/B008CLYMQW/ref=sr_1_2?ie=UTF8&qid=1476615358&sr=8-2&keywords=Zuur

check the residuals, have some theory about the source of heteroskedasticity �� and build a model accordingly, check the new residual pattern and repeat.

regards
Does this mean that you use a glm - generalized linear model?

What distribution? And what link function? How do you specify the heteroscedasticity?
 

rogojel

TS Contributor
#11
hi,
nope, the gls function from the nlme package . It is a weighted linear regression, but allows an easy specification of different variance structures. E.g. if the hypothesis is that the variance is increasing with increasing values of one variable you can specify something like

vmod=varFixed(~MyVar)
and add to the call to gls like

gls(...., weights=vmod,..)

regards