Suppose I have a random variable X that follows a standard normal distribution. Now suppose I take 1e6 independent trials from this distribution, and define a new random variable Y equal to the peak-to-peak value of that population. What is the pdf of Y?

I was thinking to use the binomial distribution. If p is \(P(x > x_o) \) for a single trial, then the probability of exactly one trial out of n yielding \(x > x_o\) is:

\({{n} \choose {k}} p^k (1-p)^{n-k} = n p (1-p)^{n-1} \)

(This would give the peak value. By symmetry, the peak-to-peak value is twice this number.)

For an example population of 1e6, this gives a normal-like (but asymmetrical) distribution centered around 9.5, which seems reasonable since for a single trial, P(x > 4.75) = 1e-6. But I did a simulation with MATLAB and the empirical results for 10000 trials of Y give a narrower distribution than my computed values, and with a different mean.

I've attached a graph of my empirical (blue) vs. analytical (red) results for the peak-to-peak value when n = 1e6. The vertical scale is obviously the histogram counts for the simulation; the red curve has been scaled arbitrarily so the general shapes of the distributions can be compared.

Where did I go wrong? I’m concerned that the event “exactly one of n trials with \(x > x_o\)” isn’t precisely the same as “finding the peak value”. (For this reason, I also tried computing the probability of “exactly 0 trails with \(x > x_o\)” and then using 1-(this probability). This gave very slightly different results but not enough to explain the difference shown by the graphic.) Another possibility is that my analytical approach is correct, but that MATLAB’s double precision math begins to fail when a number very close to 1 is raised to the 1e6 power. Or that MATLAB’s randn() method is biased if you push it to this level, leading my empirical results to be misleading.

Thanks for any ideas.