Sample Categorization?

#1
Hello,

I was wondering if you any of you knew what the best method would be for dividing a large sample (approx 30000) into 'categories'. This is because I will need to further process the data into a model which is very slow (about a sample an hour) therefore if I cut the number of samples down and take a few representative of each group this would be much quicker and perhaps a more sensible thing to do too.

Any ideas?

Thank you.
John
 

trinker

ggplot2orBust
#2
Do you have a particular program/language you're using? I could recomend an approach in R but it may be useless if you don't use R or aren't willing to use it.
 
#3
Hi Trinker,

I use SPSS and I never used R but if it's something that I can easily and quickly implement in R then I can install it and see what I get. The file is an xlsx will I need to reformat the data and do some other stuff before importing into R?

thanks
John
 

hlsmith

Less is more. Stay pure. Stay poor.
#4
Can you tell us more about the dataset. Is there a variable you are looking to use, in order to split up the set with? What would be the purpose of breaking it up then later running a model? Are you thinking about cross-validation? You can always draw random samples or split based on another variable, but I am unsure what the end all objective is.
 
#5
Can you tell us more about the dataset. Is there a variable you are looking to use, in order to split up the set with? What would be the purpose of breaking it up then later running a model? Are you thinking about cross-validation? You can always draw random samples or split based on another variable, but I am unsure what the end all objective is.
Hi hlsmith, The sample is a data-set containing information about buildings (ie area, height, and other variables) however I am only looking to 'categorize' the data-set based on the area. What I am trying to achieve is to see how many 'groups' I can create where the groups have similar areas. For example there might be 1000 buildings where the area is between 100 and 102 therefore I would class those buildings to be as 'one-group' as they would be very similar and therefore I will be able to take one building in that group to represent the other 1000. Does this make sense?
Thanks
John
 

hlsmith

Less is more. Stay pure. Stay poor.
#6
Just for starters, it may be interesting to plot these data with a histogram to see if there are multiple modes.
 

hlsmith

Less is more. Stay pure. Stay poor.
#9
I am not personally sure of the right approach, since you data seems close to normal with a positive skew. Perhaps, looking into deciles or other groups. Hopefully others can speak up with strategies.

Once again, what is the overall purpose? Why not just pull random samples out?
 
#10
I am not personally sure of the right approach, since you data seems close to normal with a positive skew. Perhaps, looking into deciles or other groups. Hopefully others can speak up with strategies.

Once again, what is the overall purpose? Why not just pull random samples out?
thanks hlsmith. The purpose is to cut the sample sown in order to reduce the time needed for further processing the data. If I take the mean from each of the groups where the data are similar then I can process just one for each of the groups and the result will still be representative. I could just take the mean for the whole thing but that wouldnt really be representative (mean 73 stdev 30).

thanks
John