Is a subset of a normal distribution normal?

#1
An easy one for some of you guys here I suspect. Thanks to anyone who can help me with it.

I have an application which adds a normally distributed random error to data values in a single data stream as they are generated. This is based on the Box Muller transform. The data comes from a number of different devices with the data from each being mixed randomly within the data stream. The altered values are routed after the error is applied and dealt with according to the device which generated them.

Is it true to say that, as the overall data stream error is of Normal distribution, the subset of the data for each individual device will also have Normal distribution errors? Or will I have to set up functionality to deal with each device's data separately?
 
#2
okay, so let's try to understand exactly what it going on.

Box-Muller transform:
http://en.wikipedia.org/wiki/Box-Muller_transform

you are using this to generate the error term that is being added to the output of all the different devices? this error term being normally distributed?

so let's say there are five different devices. How are the devices generating their output? are there a bunch of different distributions (normal, gamma, weibull) by which they are generating their output?

I think that what you have got here is the sum of two independent random variables and so in order to find the distribution of the sum you need to use the following formula (convolution):

f(x) -- pdf for first distribution
g(x) -- pdf for second distribution



although the sum of two normally distributed variables is also a normally distributed variable

http://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables
http://en.wikipedia.org/wiki/Normal_distribution

it is more complicated when one of the other distributions is something else.

feel free to talk about the setup and the application for it in more detail. This can help to understand what is going on better.


David
 
Last edited:
#3
Thanks for the response. I should have made the situation clearer in the first posting.

I am working on seismic vessel simulator software. It has a large number of navigation and positioning devices each generating their own data stream of readings but these base readings have no real bearing on the question. Some are GPS devices, others may be echo sounders, range bearing devices, acoustic devices, depth gauges etc. As their data is simulated from just a user determined task, path and system geometry, it is without random error and consequently unnatural. This data is put out at various times in just a serial fashion, one device after the other, and dealt with in turn.

I need to be able to specify significant devices in the setup and add noise to each of the data values that each of those chosen devices puts out. The remit is that the noise for each device must be normally distributed. The Box Muller functionality generates a single stream of errors, independantly of which device it comes from, by simply keeping track of an integral value from step to step. It is also able to scale them to match each devices needs without affecting the distribution. I then simply wait for each data value to be generated, generate the next error, add it onto the data value and pass it back to the relevant device for further processing. Voila, real world(ish) data from sterile generated data.

I have two choices here. I could set up a noise generator for each device with the complexity of managing them all, keeping track of which are active and their current integer seeds etc, but that could eventually end up in hundreds. Or I could use a single generator to create the error values for every device as it needs them.

I simply need to know whether the second option, a single error generator, would still give each of them a normally distributed set of errors when it is accessed for values by devices at random. There is no complication of adding a number of data streams together, only of randomly dividing a single one into a number of smaller discrete data streams. The single overall data generated by the Box Muller function is normally distributed, every value it produces is used for a device somewhere, but do the individual data streams pulled out of that at random conform to normal distribution also?
 
#4
I think that you are probably okay with the second approach. However, it may be necessary to scale the random numbers in which case I think you are alright if the transformation is linear: a*(random_number)+b from the theorem that if X is N(mu,sigma) then a*X+b is N(a*mu+b, a^2*sigma).

Consider the following situation. Suppose you go to the social security administration and get 10,000 random social security numbers. Then you divide these 10,000 social security numbers into 10 different sets of 1,000 numbers each. For each social security number you find the person and measure their height (heights are normally distributed within the American population(of course this could be time consuming to do in reality :))). So the first set has records 1-1,000 the second set has records 1,001-2,000, etc. Now let's say you do a histogram of the first group (1-1,000). This will be a nice bell shaped histogram. The second group (1,001-2,000) will also be a nice bell shaped histogram.

If you do the entire group (1-10,000) you will also get the bell shaped curve.

Certainly you would also get the same result if instead of taking the first 1-1,000 you considered records (1-9,999) and took every record for which the last digit is a 1 (this will be one thousand records), and made a histogram of it. Or if you took every record for which the last digit is a 2 and made a histogram.

So all of these subsets are still numbers from a normal distribution.

You will also get the bell shaped curve if instead of plotting the heights you plotted some linear combination (1/2)*height + 10. This comes from our theorem that if X is N(mu,sigma) then a*X+b is N(a*mu+b, a^2*sigma).

But if the scaling is not linear than you may not have a normal distribution anymore. Certainly it is not the case that X is N(mu,sigma) then any function f(X) is also normal distributed. For example the function f(x) may be f(x) = 0 * x which is definitley not normally distributed. One could also imagine a function which is stepwise which is not normally distributed.

I think also that if f(x) = x^2 then f(X) is not normally distributed.

David
 
Last edited:
#5
Thank you for that very clear and concise explanation. That is good news for me. What you have described in your example seems to be much on the same lines as what I have happening, so my conclusion is to go ahead with a single generator which can have its result distributed amongst the devices across my system. This is much easier than having a series of generators, one for each device, created and destroyed as I require them, each with their own separate random seed to track.

Thanks again, I really appreciate the time you have spent on helping me with this, it has made my task a lot easier.