# Estimating a distribution

#### dryguy

##### New Member
Hi,

Suppose I had some bags containing a large number of black and white marbles. Each bag has a uniform distribution, but I don't know in advance what the distribution is. I'm allowed to take samples and use the samples to estimate the distribution of marbles in each bag. For example, I might take 10 marbles and get 3 black and 7 white. In that case, I might estimate that the distribution in the bag was 30% black and 70% white.

My question concerns cases in which the sample contains only one color (for example, ten white marbles). This would seem to indicate that the bag has mostly white marbles, but how would I use this information to estimate the distribution in the bag? My intuition tells me that the larger the sample that I get that is all white (say 100 white marbles), the closer my estimate of the distribution should be to 100% white. Conversely, for small samples (say 1 white marble), the less I know about the distribution.

My best (very naive) guess, is that I would do well to estimate the fraction of white marbles as n/(n+1), where n is the number of marbles in my all white sample. This at least has the property of approaching 1 as the sample size increases.

Does anyone know the solution to this problem (or where I should look for the answer)?

#### JohnM

##### TS Contributor
"My best (very naive) guess, is that I would do well to estimate the fraction of white marbles as n/(n+1), where n is the number of marbles in my all white sample. This at least has the property of approaching 1 as the sample size increases."

Hmmm.....personally I don't know the correct answer here, but be careful about assuming that your sample is automatically representative or closely matches the population....what if the actual population distribution is more like 85% white and 15% black, and you just got "lucky" and sampled all white marbles?

#### dryguy

##### New Member
estimate for small samples

Hmmm.....personally I don't know the correct answer here, but be careful about assuming that your sample is automatically representative or closely matches the population....what if the actual population distribution is more like 85% white and 15% black, and you just got "lucky" and sampled all white marbles?
That's my concern. The proportion of white and black marbles in the bag is unknown. It could be 100% white, 0% white or any value inbetween. The problem is to estimate the proportion based on a sample. I would expect a formula such as w/(w+b) to give a good estimate of the proportion of marbles in the bag, where w and b are the number of white and black marbles in the sample.

In some cases, you might expect the sample to contain only white marbles; for example, if the bag contains only white marbles, or a large fraction of white marbles. If the bag contained a significant fraction of black marbles, the odds of drawing only white marbles should be low for large samples, but somewhat more common for small samples.

The part that puzzles me is that using the formula w/(w+b) to calculate the estimate gives 100% white for any sample containing only white marbles. This seems to be a reasonable conclusion for a sufficiently large sample, but not for small samples. If you have drawn only 1 white marble, isn't an estimate of 100% white a bit optimistic? I can't help thinking there must be a better equation to use for small sample sizes to reflect the greater degree of uncertainty.

#### ssd

##### New Member
Let the proportion of white balls in a bag be p (fixed).
Now u draw a ball form the bag and call that a success if it is white and failure otherwise. Then its a Bernoullian trial with prob success=p.
Perform the trial n times, each time replacing the drawn ball in the bag. The no. of drawn whites balls be =f. Then f~Bin(n,p). Therefore f/n is an unbiased estimator of p and f/n -->p for large n.

#### dryguy

##### New Member
Thanks for the reply. I'm interested in the case where n is small and without replacement.

#### ssd

##### New Member
Thanks for the reply. I'm interested in the case where n is small and without replacement.
The distribution of the no. of white balls in a draw (at a time) of n balls is truly Hypergeometric.
You mentioned that total no. of w & b balls in the bag is large and n is small. This is typically the case when Hypergeometric probabilities converge to Binomial probabilities. Therefore w/(w+b) is the unbiased estimator of the population proportion, where w and b are the no. of white and balck balls respectively in the sample of n balls.

The fact that f/n -->p for large n, holds the key to your problem. Smallness of n does not imply n=1, here it implies n/N is negligible for all practical purposes, where N is the total no. of balls in the bag. You cannot infer on the basis of sample of size 1 in this case. Think of a perfectly unbiased coin with probability of getting head = 0.5. In a single toss of the coin you are bound to get one of either head or tail. Whatever you get, can you infer that the coin is perfectly biased since the sample proportion is 1? If you cannot infer what is the reason for that?

Last edited:

#### dryguy

##### New Member
n=2

You cannot infer on the basis of sample of size 1 in this case. Think of a perfectly unbiased coin with probability of getting head = 0.5. In a single toss of the coin you are bound to get one of either head or tail. Whatever you get, can you infer that the coin is perfectly biased since the sample proportion is 1? If you cannot infer what is the reason for that?
I'm not sure what you are getting at, but in any event, the case for n=2 is no more comforting. Suppose I draw 2 white marbles. w/(w+b) = 1. I know more about the proportion of N than I did when n was equal to 0 or 1, yet making the estimate that the fraction of white marbles is 1 seems unrealistically optimistic. Since making my first post above, I have found error formulas such as erf*sqrt(2p(1-p)/n) (see: http://mathworld.wolfram.com/SampleProportion.html), but in the present example, (1-p)=0, implying that the uncertainty in p = w/(w+b) is zero. Shouldn't my uncertainty be large for n=2 and get smaller as n increases?

#### ssd

##### New Member
I'm not sure what you are getting at, but in any event, the case for n=2 is no more comforting. Suppose I draw 2 white marbles. w/(w+b) = 1. I know more about the proportion of N than I did when n was equal to 0 or 1, yet making the estimate that the fraction of white marbles is 1 seems unrealistically optimistic. Since making my first post above, I have found error formulas such as erf*sqrt(2p(1-p)/n) (see: http://mathworld.wolfram.com/SampleProportion.html), but in the present example, (1-p)=0, implying that the uncertainty in p = w/(w+b) is zero. Shouldn't my uncertainty be large for n=2 and get smaller as n increases?

The standard error (SE) of f/n=sqrt[p(1-p)/n]. As n-->infinity, the SE tends to 0.
This is a desirable property of any estimator. Note: SE is nothing but the standard deviation of the estimator which gives you the idea that how far the values of the estimator are dispersed (distant from one another) from sample to sample on an avarage. For p=0.5 these are 0.5 ,0.25, 0.1666... for n=1,4,9 etc.
Note that unbiasedness and minimum variance of estimators are the two "most" desirable properties.

Now into your uncertainty part: I hope you have idea about the difference between Random and Deterministic experiments. The nature of the random expt.s are such that you cannot predict the expt. results before hand surely. That is why the concept of probability emerges and goes on to inferential theories. In your question if (1-p)=0 what happens to the population? It is full of balls of a single colour. Then drawing a ball from it does not lead you to a "Ramdom Experiment". However small or however large a sample you draw, it always repeats the exact same result with variance(or, for that matter SE) of the estimator f/n equals to 0 irrespective of what your sample size is.

Suppose you have 999 white balls and 1 black ball in the bag and you are drawing a sample of size 10 at a time from it. It is most likely that you get all white balls in the sample. Then your estimate of p from this particular sample is 1...yes 1. That is how Statistican's philisophy is built..... because you have used an estimator which converges in probability to the parameter. Note that estimates from the best estimators even are not exactly equal to the parameters in general (if it is so, that is by chance) but they are built in a way that their errors are small. In my example also the error is |0.999-1|=0.001.

If you have any further question donot hesitate to post.... but please be specific in describing the question...... I shall try to answer.

Last edited:

#### dryguy

##### New Member
Se=0

The standard error (SE) of f/n=sqrt[p(1-p)/n]. As n-->infinity, the SE tends to 0.
If the sample fraction is equal to either 1 or 0, then by that equation, SE = 0, regardless of sample size.

Suppose you have 999 white balls and 1 black ball in the bag and you are drawing a sample of size 10 at a time from it. It is most likely that you get all white balls in the sample.
Precisely the type of case that bothers me. If I draw 10 white, I would be less confident in my estimate that the fraction of white is 1 than if I drew 998 white. But, by the SE equation, SE = 0 in both cases, because the term (1-p) = 0. Shouldn't SE be large when n is small?

If you have any further question donot hesitate to post.... but please be specific in describing the question...... I shall try to answer.
Thank you!

#### ssd

##### New Member
If the sample fraction is equal to either 1 or 0, then by that equation, SE = 0, regardless of sample size.
No you are wrong: Please note f/n is the sample fraction and p = the population fraction. SE=0 if p =0 or p=1. A particular value of the sample fraction can be 1,0 but that does not make SE=0. Note that if the total number of balls (distinguishable) in the bag is N then NCn possible samples of size n are there.

SE^2 of f/n is the variance of f/n .That is, p(1-p)/n is the variance of all the numerical ( observed) values of f/n obtained from NCn possible samples of size n. If all of these possible values of f/n are equal then only SE=0 (this tacitly implies p=1 or 0 for your problem).

Last edited:

#### dryguy

##### New Member
Thanks for clarifying the underlying meaning of the SE equation.

In the time since my first post nearly a month ago, I have refined my thinking about this problem through reading and discussion. Our most recent discussion has convinced me to try and formulate a better statement of the problem. If anything strikes you as poorly defined, please let me know and I will try again.

You are the royal statistician for the King of a small country. Recently, the King’s spies captured a machine from the enemy that manufactures orange and green marbles, along with an instruction manual. The instructions say that the machine can only produce one marble per day, and that the color of the marble will be determined at random. The manual further states that a factory preset determines the fraction of orange marbles that the machine will produce on average. Unfortunately, the spies could not find any evidence to indicate the value of the factory-preset value for the machine they captured.

Since the nation’s GNMO (Gross National Marble Output) is critical to national security, the King has asked you to provide a daily report stating the current best estimate of the factory preset and to give an indication of the uncertainty of the estimate. The first report is due immediately, before the machine has produced any marbles. What do you tell the King? After 1 day, the machine produces an orange marble. How does your report change? How does it change the next day after another orange marble is produced? What do you report on day 1000, by which time the sample consists of 1000 orange marbles and 0 green marbles? What equations are you using to generate the answers?

I’ll attempt to answer. In the first report, I would state that no estimate is possible because no data is available on which to base an estimate. The uncertainty is 100%.

In the second report, I have a sample of 1 orange marble, and I would state that the best estimate of the factory preset is 1 (f = x/n = 1/1 = 1), but that the uncertainty of the estimate is large. What I currently can’t figure out is how to express the uncertainty numerically.

In the third report, I have a sample of 2 orange marbles, and I would still give an estimate of 1 for the factory preset (f = x/n = 2/2 = 1), but now, my uncertainty is slightly lower (but how much lower)?

In the 1001st report, I have a sample of 1000 orange marbles. I am now very confident that the factory preset is 1 or very close to 1 (f = x/n = 1000/1000 = 1). My uncertainty is far lower than it was in the first two reports, but by how much?

#### ssd

##### New Member
Let p be the population proportion of orange balls. The eatimates of p (=f/n) which you have stated are correct. f= no. of orange balls when total of n balls is produced. Before I move into the main answer, I shall say that when no balls are produced one cannot have an estimate and hence no question of uncertainty arises (not that the uncertainty is 100%). You specified nothing as your estimate..... so uncertainty of 'nothing' is 'nothing'. Your report simply shall be: Estimation is not possible.

Now I shall talk of the confidence you have about your decision after the nth ball is produced:
From the proof of Bernoulli's theorem we have
P[|f/n-p|>c] <= 1/(4nc^2), where c is any arbitrary positive number <1, (it can be very small).

or, P[|f/n-p|<=c] > 1- 1/(4nc^2)= a, say.
The quantity a= 1- 1/(4nc^2) serves as the confidence of the statistican after nth ball is produced.

Note on the quantity 'a':

1/ Check that 'a' is defined as a probability. Therefore, for some c (which is arbitrary) if 1/(4nc^2)>1, take a=0.

2/ 'a' is the smallest probability that the statistican's decision has a maximum absolute error of 'c'. In other words, his decision of estimation of p by f/n differs from the truth by an amount more than 'c' has got the probability less than 1-a.
Your confidence is that the decision you make shall hold (with maximum error of 'c') with probability more than 'a'.

3/ As n increases, for any fixed 'c', 'a' also increases and finally a-->1 as n-->infinity. Simply, larger is the value of n, higher is the Statistican's confidence on his decision that he cannot make an error more than 'c'.

Last edited:

#### johafre

##### New Member
Hi,
I have a problem which is similar to the problem discussed in this thread.

Say that instead of only black and white marbles, there are marbles of many different colors in the bag. How many is unknown and so is the distribution of the different colors.
I sample X marbles and I want to use this sample to get information about the marbles in the original bag. I would like to estimate how many different colors there are in the bag and I would also like to know the distribution of the different colors - which colors are most common?

By counting the frequencies of different colors in my sample I can estimate the total number of different colors in the bag. (for example Chao, A. (1984). Nonparametric estimation of the number of classes in a population. Scandinavian Journal of Statistics, 11, 265-270.)
I can then say that I have a certain coverage, i.e. that I have discovered , say 65% of the total number of different colors in the bag.
However, can I be sure that the distribution of different colors in my sample are similar to the distribution in the original bag?
If I have mostly red and green marbles in my sample, can I assume that red and green marbles are the most common in the bag as well?
How do I approach this problem?