How to create a subset with a normal distribution

#1
Hello,

I have a presumably easy question for you, but I am a real newbie in statistics, so please be patient. I have a data set containing N values that more or less follows a normal distribution. I need to select N/10 values among them, creating a subset with the same distribution. How can I do that? Do you have softwares to recommend? I usually use PSPP.

Thank you in advance,
Elisa
 

hlsmith

Not a robit
#2
You can easily partition data into subsets. Do you wish for replacement or not (meaning an observation could be selected more than once? Are you desiring for the subset to have the same distribution moments or just be a random subseT?

You can do this in most any program. If a program does not have a direct function for it a typical work around is just to simulate a random continuous variable with the same number of values as your dataset. Next merge it to your set. Lastly, sort file on new variable and select the first ten observations after the sort. You will want to use a seed value when simulating to ensure repeatability of the process.
 

j58

Active Member
#3
A random sample of the values will have the same theoretical distribution as the full set. So, generate a random sample (without replacement) of size (N/10) from the integers 1,2,...,N, and then pick the values in your data that have those positions in your data set. For example if 7 appears in your random set of integers, add the 7th value to your subset. In R, this is two lines of code:

idx <- sapmple(N, N/10)
subset <- your_data[idx]
 
Last edited:
#4
A random sample of the values will have the same theoretical distribution as the full set. So, generate a random sample (without replacement) of size (N/10) from the integers 1,2,...,N, and then pick the values in your data that have those positions in your data set. For example if 7 appears in your random set of integers, add the 7th value to your subset. In R, this is two lines of code:

idx <- sapmple(N, N/10)
subset <- your_data[idx]
This sounds pretty easy, but..in my newbie mind..the quality of this sample should depend on the size of the data set, doesn't it? I mean..if the original dataset is "small" (like 50 values), is a 5 values subset still normally distributed?
 

hlsmith

Not a robit
#5
It will be a possible subsample realization of the population. So yes, it does run the risk of varying quite a bit from the moments in your population. But this is the general nature of small samples.
 

j58

Active Member
#6
This sounds pretty easy, but..in my newbie mind..the quality of this sample should depend on the size of the data set, doesn't it? I mean..if the original dataset is "small" (like 50 values), is a 5 values subset still normally distributed?
A finite sample can never have a normal distribution. A normal distribution is a theoretical abstraction. As hlmsith implies, if you draw a sample from a normal distribution, the larger the sample the more closely the sample will resemble a normal distribution. But theoretically, every random draw from a normal distribution can be thought of as a random variable with a normal distribution. Whether or not that satisfies what you're trying to do is unclear. You didn't provide much in the way of background information.