When the central limit theorem works...and when it doesn't?

noetsi

Fortran must die
#1
I had always thought that given say a 100 cases the central limit theorem always worked. But today I read (in a book about multilevel analysis) that...

The central limit theorem holds in practice..if the individual variances are small compared to the total variance. For example the heights of women in the United States follow an approximate normal distribution. The central limit theorem applies here because the height is affected by many small additive factors. In contrast, the distribution of heights of all adults in the United States is not so close to normality. The central limit theorem does not apply here because there is a single large factor -sex- that represents much of the total variation.
I actually did not think the Central Limit Theorem ever applied to raw data. I thought it applied to the distribution of the statistic.
 

Dason

Ambassador to the humans
#2
In that example they're considering the single variable to be a sum of a lot of different factors. Like how you might consider the amount of time it takes to drive to work to be the amount of time it takes to drive from home to point A, from point A to point B, from point B to point C, and from point C to work.

The amount of time it takes me to write this reply can be thought of as the sum of the amount of time it takes to write each word. Sometimes people like to think about their data as derived from other variables in this fashion.
 

rogojel

TS Contributor
#3
hi,
the CLT is talking about the sum of many independent random variables. This can be raw data if it results from the additive effects of many small influences or a statistic such as the mean.

I bet, the mean height of inhabitants in the US would follow a normal distribution - e.g. if you took random samples of 100 people and calculate the mean height of each group. The individual heigths would not because they are not the result of the small effects of many random variables, there is one variable that has a large effect, sex.

1 minute too late :)

regards
 

noetsi

Fortran must die
#4
In that example they're considering the single variable to be a sum of a lot of different factors. Like how you might consider the amount of time it takes to drive to work to be the amount of time it takes to drive from home to point A, from point A to point B, from point B to point C, and from point C to work.

The amount of time it takes me to write this reply can be thought of as the sum of the amount of time it takes to write each word. Sometimes people like to think about their data as derived from other variables in this fashion.
I usually don't think about variables like height this way, although I can see that it makes sense. But I also did not realize that one variable being influenced by many others, had anything to do with the central limit theorem. :p
 

noetsi

Fortran must die
#5
The CLT is talking about the sum of many independent random variables. This can be raw data if it results from the additive effects of many small influences or a statistic such as the mean.
Given that nearly anything is the sum, or influenced by, many factors wouldn't this make the CLT apply generally? And clearly many variables have highly abnormal distributions.
 
#6
I had always thought that given say a 100 cases the central limit theorem always worked. But today I read (in a book about multilevel analysis) that...

I actually did not think the Central Limit Theorem ever applied to raw data. I thought it applied to the distribution of the statistic.
Since when does the central limit theorem not apply to bimodal population distributions?

I am not understanding what the authors of that book are trying to imply. If I were to take n samples from a bimodal distributed population, it would most certainly converge at a normal distribution as n approached infinite.
 
#7
I had always thought that given say a 100 cases the central limit theorem always worked.
The central limit theorem has various forms and "works" (or not) under varying circumstances. For example, there are some cases where the CLT doesn't hold, irrespective of the sample size, as in the case of a standard Cauchy random variable. The determination of a "large enough" sample will also depend on how much the underlying distribution deviates from normality. Sometimes 20-30 observations are enough, and some times thousands of observations are needed.
 

CowboyBear

Super Moderator
#8
This idea - of some variables being the result of the small additive effects of many other random variables - underlies the assumption of normally distributed measurement error in classical test theory.
 

CowboyBear

Super Moderator
#9
The central limit theorem has various forms and "works" (or not) under varying circumstances. For example, there are some cases where the CLT doesn't hold, irrespective of the sample size, as in the case of a standard Cauchy random variable. The determination of a "large enough" sample will also depend on how much the underlying distribution deviates from normality. Sometimes 20-30 observations are enough, and some times thousands of observations are needed.
Yep. I'd add that that the idea of a minimum number of observations isn't about making sure the CLT will "work". Take a case where you are calculating the mean of a set of independent random variables. What the CLT says is that as the number of variables in the set of independent random variables you're averaging increases, the sampling distribution of the mean will converge towards a normal distribution. When people say you can invoke with CLT with X number of cases, what they mean is that with this number of cases you can be reasonably sure the sampling distribution of the statistic will be approximately normal - due to the CLT. It's not a case of the CLT being invalid with small sample sizes and valid with large ones.
 
#10
Yep. I'd add that that the idea of a minimum number of observations isn't about making sure the CLT will "work". Take a case where you are calculating the mean of a set of independent random variables. What the CLT says is that as the number of variables in the set of independent random variables you're averaging increases, the sampling distribution of the mean will converge towards a normal distribution. When people say you can invoke with CLT with X number of cases, what they mean is that with this number of cases you can be reasonably sure the sampling distribution of the statistic will be approximately normal - due to the CLT. It's not a case of the CLT being invalid with small sample sizes and valid with large ones.
Definitely good to mention. That's mainly why I used "works" to imply a fast and loose, but likely more common and less correct, interpretation of it. I think people tend to miss the idea that it's not black and white, but rather has to do with the appropriate use of a normal distribution for inferences. In other words, is the sampling distribution you're working with for that fixed sample size reasonably approximated with a normal distribution? If so, we can use some more familiar approaches. If not, we lose some nicer properties and need to look elsewhere for some support. It seems to me like a lot of this gets lost on many people and they boil it down to black-and-white thinking as they've done with the rest of their stats knowledge (if some of it is lost on me, I'm currently unaware, so feel free to point it out!).
 

CowboyBear

Super Moderator
#11
It seems to me like a lot of this gets lost on many people and they boil it down to black-and-white thinking as they've done with the rest of their stats knowledge (if some of it is lost on me, I'm currently unaware, so feel free to point it out!).
Yep - some good "psychology of data analysis" in there!