View Full Version : How to calculate the expected overlab between sets

05-11-2010, 02:05 PM

I wish to do the following:

I have two vectors with N elements in each. All elements are set to zero.

Now, in each vector I select n indices at random and set the corresponding elements to 1.

My goal is to calculate the expected overlap between the vectors as a function of n. That is, how many elements have 1 in both vectors.

Initially I thought about using the binomial distribution with probability p=(n/N)*(n/N), but clearly this is incorrect, since we have the constraint that only n elements should have the value 1. Therefore we have not n independent trails. I also cannot apply the hyper-geometric distribution directly, since we consider two random variables.

Any thoughts? :confused:

05-11-2010, 02:45 PM
Without loss of generality, we consider one of the vector first,
which have n 1s and (N - n) 0s. We label the indexes for these n 1s,
and try to count how many 1s in the second vector fall into these n indexes.

Let K be the number of overlapping indexes.
We can choose k 1s to put in those n indexes and put the remaining n - k
1s into the other N - n indexes. So

Pr\{K = k\} = \frac{\tbinom {n}{k} \tbinom {N-n}{n-k}}{\tbinom {N}{n}},
k = 0, 1, ..., n

i.e. K has a hypergeometric distribution,
with E[K] = n\frac{n}{N} = \frac{n^2}{N}