Sampling with replacement - probability problem

#1
Hi all,

(this is my first post, hopefully I am not breaking any rules).

I have the following problem that I cannot solve:

I have a set of 'n' different elements. I sample 'k' elements at random, then I put them back in. Then I sample again 'k' elements at random, and I put them back in. And so on. I repeat the operation 'r' times. What is the probability that I picked each of the 'n' elements at least once?
I just need a formula that gives the probability as a function of 'n', 'k' and 'r'. If you can prove it, it's a welcome bonus.
Thanks.
 

hlsmith

Omega Contributor
#2
So to dumb it down, you have 10 numbered balls, you grab a sample less then the total number of balls repeatedly. After you select a single ball, you are not returning it to the bag until after the sample is full at k, making it ineligible to be in a single sample more than once?


So you can't have 1,2,2,2,4,10, since 2 was not returned during that sample selection.
 

rogojel

TS Contributor
#4
Hi,
the sort of brute force approach would be to imagine n slots where we are randomly putting a total of kr stones into them. Each slot can have between 0 and kr stones, the only constraint being that the sum of stones across all n slots must always be kr.

So the probability of having each slot occupied would be (number of ways kr can be particioned into n integers, without any of them being 0)/ (number of ways kr can be partitioned into n integers with some possibly being 0)

Partitioning an integer into a given number of sub-sums is a pretty heavy combinatorial problem if I recall correctly

On thinking this over - the probability can be expressed in a formula like:

(p(kr,n) - Sum(p(kr, n-i))/p(kr,n) which is

1-Sum_i(p(kr, n-i)/p(kr,n)

where p(N,r) is the nzmber of ways N can be partitioned into r terms. https://en.m.wikipedia.org/wiki/Partition_(number_theory)#Intermediate_function


Regards
 

BGM

TS Contributor
#5
See if I correctly understand the model.

The probability that a particular group of [math] m [/math] elements are not selected in a particular operation is

[math] p_m = \frac {\displaystyle \binom {n-m} {k} \binom {m} {0}} {\displaystyle \binom {n} {k}} = \frac {(n - m)!(n-k)!} {n!(n-m-k)!}
= \prod_{j=1}^m \frac {n-m-k+j} {n-m+j} [/math] [math]
= \prod_{j=1}^m \left(1 - \frac {k} {n-m+j}\right)[/math]

So if they are not selected in all [math] r [/math] operations, then the probability will be [math] p_m^r [/math]

As what Englund hinted, let [math] A_i [/math] be the event that the [math] i [/math]-th element is selected at least once out of the [math] r [/math] operations, [math] i = 1, 2, \ldots, n [/math]. The required probability is

[math] P\left(\bigcap_{i=1}^n A_i\right) [/math]

[math] = 1 - P\left(\bigcup_{i=1}^n A_i^c\right) [/math]

[math] = 1 - \sum_{m=1}^{n-k}(-1)^{m-1}\binom {n} {m} p_m^r [/math]

where the second inequality is the de Morgan's law and the last equality is given by inclusion-exclusion principle.
 
#6
Hi all,

thank you for your replies, very much appreciated.
I'll give you some background to this problem, to see if we got it right.

There is a popular blog (in Italy), which posts satire jokes about the news. It is crowd sourcing: users send jokes through a forum, and the supposedly best ones are periodically posted on the main page of the blog. Normally a post consists of 25-30 jokes, but sometimes a piece of news is so popular that more jokes that that are selected. This would make the post bulky, and the reader would have to keep on scrolling down.
So the blog's administrator thought he had a 'brilliant' idea: in those cases, he selects say 40-50 jokes (n) but he only posts 8 of them (k), randomly selected. Every time the reader refreshes the page, 8 jokes are again selected at random among those 40-50. So I think this is a sampling with replacement situation.
I'd like to show that administrator that his 'brilliant' idea is actually a dumb one, and that if he used a sampling without replacement algorithm, everyone would be happier.
But to do that I need the formula I asked for.
I'll try BGM's solution to see how many times (r) one has to refresh the page to have a 95% probability (p) of having read all the jokes at least once. I guess the F5 key will be so worn-out that you can see New Zealand through it.

Thanks again!
 

rogojel

TS Contributor
#7
hi,
this is then a rather simple proposition:you only need to simulate the selection a few
thousand times :) and show him the results. In my opinion that would be by far more convincing then any formula - and much easier to do.

regards