Sampling from N data points based on a given probability distribution

I have N data points, and each data points is associated with a probability. The sum of the N probabilities equals to one. I need to sample these N data points based on the given probability vector. Are there any algorithms that can allow me to do this kind of sampling?


TS Contributor

Suppose you want to simulate a discrete distribution with support \( \{x_1, x_2, \ldots, x_N\} \) (can be countably infinitely many) with probabilities \( p_i, i = 1, 2, \ldots, N \) respectively.

You can use the inverse transform sampling as usual (now the CDF is just a step function with jumps):

1. Generate \( U \sim \text{Uniform}(0,1) \)

2. Assign \( \text{Sum} = 0, i = 0 \)
Repeat \( i = i + 1, \text{Sum} = \text{Sum} + p_i \) Until \( \text{Sum} > U \)

3. Output \( x_i \)

Note that if \( p_i \) is sorted in descending order, then this algorithm will be most efficient. Some well known discrete distributions have more efficient algorithm but the above algorithm is applicable to all discrete distributions.