# comparing three sets

#### statQuery

##### New Member
Dear all,

I have three sets, A (495 elements), B (1130 elements) and C (812 elements).

The elements are biological entities bound by a proteins and for each set I have the count of elements occupied by 1, 2,3 ... proteins.

For instance, I have 226 elements bound by 1 protein in set A vs 258 in Set B...

In Set C, we do see that we have more elements bound by many proteins (i,e 15 elements bound by 11 proteins) than in both sets A and B.

My question: I need to make a statistical test to see if Set C has significantly more elements bound by several proteins than set B. What I would like to have is a pvalue for each pair of sets telling me if the count distributions are significantly different between sets.

Any help can be appreciated,

Best.

Set A
1 2 3 4 5 6 7 8
226 143 75 31 12 4 3 1

Set B
1 2 3 4 5 6 7 8 9 10 11 12 13
258 205 181 152 113 77 63 32 30 6 9 2 2

Set C
1 2 3 4 5 6 7 8 9 10 11 12 13 14
142 168 99 80 73 80 43 44 25 28 15 8 4 3

#### GretaGarbo

##### Human
First I thought of doing it by the Poisson distribution, conditional that the value is larger than zero. (Just divide the probability mass function with the probability of zero.) And then do a likelihood ratio test between set A and set B etcetera.

But why make it complicated? The sample size is large so the means will be approx. normal by the central limit theorem. A usual z-test can be used (which will be the same as a t-test here as the sample size is large, thus degrees of freedom is very large).

And of course it will be statistically significant.

(My main problem was going from frequencies to values)

Code:
# this is a R program, download it and run it!

set.a.values <-
c(1,	2,	3,	4,	5,	6,	7,	8	)

set.a.freq <-
c(226,	143,	75,	31,	12,	4,	3,	1)

set.b.values <-
c(1,	2,	3,	4,	5,	6,	7,	8,	9,	10,	11,	12,	13	)

set.b.freq <-
c(258,	205,	181,	152,	113,	77,	63,	32,	30,	6,	9,	2,	2	)

set.c.values <-
c(1,	2,	3,	4,	5,	6,	7,	8,	9,	10,	11,	12,	13,	14	)

set.c.freq <-
c(142,	168,	99,	80,	73,	80,	43,	44,	25,	28,	15,	8,	4,	3)

set.c.values
set.c.freq

set.a <- rep(set.a.values, set.a.freq )

set.b <- rep(set.b.values, set.b.freq )

set.c <- rep(set.c.values, set.c.freq )
# set.c

table(set.a)
table(set.b)
table(set.c)

mean(set.a)
#[1] 1.967677

mean(set.b)
#[1] 3.559292

mean(set.c)
#[1] 4.252463

hist(set.a)
hist(set.b)
hist(set.c)

t.test(set.a, set.b,  var.equal = FALSE)
t.test(set.a, set.c,  var.equal = FALSE)
t.test(set.b, set.c,  var.equal = FALSE)

But I am not sure if I understand the problem. What is "entities"?

I pretend that the elements corresponds to persons and they are asked how many coins they have in their pockets. So there would be 226 persons with just one coin, 143 persons with two coins etc. in group A.

Are these data statistically independent? Is there any pseudo replication is this? I could have misunderstood the situation completely.