# Thread: Test for multiple means in a dataset

1. ## Test for multiple means in a dataset

In a dataset of arbitrary size what type of test could I apply to say with confidence that there are actually multiple smaller datasets present?

For example if the data were to come from N means each with it's own standard deviation. How could I test with confidence how could I determine with confidence what N was?

Does this question make sense?

Thanks

2. ## Re: Test for multiple means in a dataset

there are several methods depending on what mixture of distributions you choose to impose on your data, but my favourite one has always been latent class analysis...

3. ## Re: Test for multiple means in a dataset

Thank you very much for responding. After posting my question yesterday I had been reading about choosing unequal bin widths in histograms to see if that would help me sort my data. Many of the Latent class analysis examples seem quite complex. At some level does the latent class analysis essentially do the same thing?

4. ## Re: Test for multiple means in a dataset

Originally Posted by astickel
I had been reading about choosing unequal bin widths in histograms to see if that would help me sort my data.

hello there. before i post a more complete answer to your question i would like to know exactly what you're doing in this case.... so are you basically altering the limits of each class to have different frequency sizes? if this is the case, based on what evidence are you or would you be altering the limits so you get more elements in one class rather than in another one?

5. ## Re: Test for multiple means in a dataset

Just a quick post without thinking too much about this. If you're not wanting to get too complex, you could try any sort of simple cluster analysis. Here's a website with a couple of options:

And an example I just threw together:

Code:

library(fpc)
sim.dat<-c(rgamma(250, 8,4),rnorm(125, 16,2),rnorm(200, 28,3),rnorm(50, 50,3))  ## A simulated dataset with overlapping "clusters"
hist(sim.dat,breaks=40)
clusters<-pamk(sim.dat,krange=2:7)  # specify a k range of hypothesized clusters. Wider the range, the longer it takes.
clusters
Again, I suggest this without knowing any more information than you originally posted. Others may weigh in about my premature response!

6. ## Re: Test for multiple means in a dataset

well... cluster analysis is a specific instance of latent class analysis... k-mean cluster analysis, hierarchical cluster analysis, etc. are all specific instances of the more general latent class analysis method, depending on which parameterisation you assume for your data... that's why i said latent class analysis first

7. ## Re: Test for multiple means in a dataset

Thank-you again.

The following are a few of my datapoints which in this example clearly don't overlap so this should be easy. What I am hoping to have is an algorithm which like you say will determine how many clusters are most likely in the dataset (In this case two). I also would like the algorithm to calculate what the mean values of each of these clusters is. Most likely there will be between 1 and 4 clusters in my data if that helps too. Thank you both again for your thoughts!

4.801472667
3.473225533
-0.425527926
4.759339301
3.993423134
26.24527325
27.81263542
27.68116643
27.4480586
27.17240416
26.66044742
27.8599481
28.69223408
28.08911254
29.03902105

8. ## Re: Test for multiple means in a dataset

I installed R with the fpc package and exectuted the sample code from jpkelley, thanks for that. Yes it looks like this is what I am looking for. I will have to do some reading to make sure and make sure I understand this. Thank-you both again for you're help!!!

9. ## Re: Test for multiple means in a dataset

ha... i was about to suggest to use jpkelly's code which (s)he was so kind to share with us because you're right, latent class analysis can get very tricky and sometimes clustering analysis does just as fine... anyways, if you ever need to look at stuff like the probability of belonging to one group or another you can use the lca command in the e1071 package or the flexmix package and it works similarly to the example posted... give the function a set of data and tell it a range of groups for it to find and it'll find the best number of groups that account for the most patterns of differences in your data...

10. ## Re: Test for multiple means in a dataset

Whew, I'm glad it was useful. I wasn't sure if it would be. Once you understand how the k-means analysis works to find the medoids, I think you'll be quite pleased with it. Again, it's simple...that might be good or bad for your purposes.

(I'm realizing I should have used a different user name to reduce ambiguity...and to sound cooler. I'll introduce myself in the new user thread soon.)

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts