# Determining how similar different sets of data are

#### jkaupanger

##### New Member
I hope this is in the right spot:

I have a dataset of ~15,000 unique, nominal, categorical members. From that dataset, I have ~3,000 samples that are subsets of the overall dataset. Each sample has 100 unique members. A member's inclusion in a sample/subset is binary: either it is fully included or it is fully excluded. There may be overlap from one sample to the next; in fact, it is possible that multiple samples may be identical, but it's somewhat unlikely. Each sample is made non-randomly; in other words, each member is specifically chosen by the individual making the sample/subset. I also know the number of samples an individual member appears in (member1 is in 2500 samples, member 50 is in 100 samples, etc.).

What I'm trying to do is determine a quantitative way to measure how similar different samples are to each other. More specifically, I'm trying to determine tendencies between samples (if a sample includes member5020, what's the likelihood that it also includes member15000?; if a sample includes members 1, 2, 5, and 2000, what's the likelihood that it also includes member2001).

Which measure(s) of association is/are well-suited for analyzing individual members of subsets data? In my research, the one measure of association that keeps coming up is chi-squared based on a contingency table, but that seems like it's more suited for each subset as a whole rather than looking at individuals in subsets.

Any advice/suggestions would be fantastic. The other difficulty is my unfamiliarity with the subject: I've taken three semesters of calculus and one semester of differential equations, but a lot of the language and notation used on statistics websites is too much for me to noodle through; it's possible I've stumbled on the right method(s) but I wasn't able to understand what I was reading.