Discussing a model design

#1
I am a software developer and like stats but no professional so I would like your opinon on how to build a concrete statistical model
the goal is to understand what are the most eaten foods in Austria
I would like to know what you think of my approach: I go to a supermarket and observe the cash desk for an hour and record every food that has been bought
this is my sample
the population is the overall consumption of items per food. e.g. 23000 liters of coca-cola, 600000 apples
is this a wise design? do I need to record more?
in the end i want to know the 30-percentile of the most eaten foods
How would you do it?
 

Dason

Ambassador to the humans
#2
That design actually only gives you a sample size of 1 - which really restricts the ability to inference. There is no randomization and if you're hoping to generalize to all of Austria then visiting a single supermarket isn't going to cut it.

I'm also a little confused by what you mean by the "30-percentile of the most eaten foods".
 
#3
I think that a supermarket is already optimized statistically, the shelf space costs and you have to decide what products you are putting in. Would it be better to sample the shelves to because they represent already the truth about Austrian taste?
If I go for sampling more than one supermarket how much would I need to get the accuracy required for my mission?
My mission restated: "30-percentile of the most eaten foods" I mean the the 30 percent most eaten food items in Austria. It's called the 70% percentile I think.

thank you!
 

Dason

Ambassador to the humans
#4
You most definitely need more than one supermarket. Otherwise all you're getting is a small sample from a single supermarket on a single day. Does that supermarket even carry all of the food items in Austria? Probably not. Do the people that visit that one particular supermarket really reflect a random sample of everybody in Austria? My guess is no - there is probably some community effect.

Also - if you only sample supermarkets then you're really only sampling the foods that people who shop at supermarkets get.

When you're talking about the 70th percentile of most eaten foods - I still say that isn't really a well defined quantity. I think I understand what you're trying to say but I don't understand why you want to know that.
 
#5
When you're talking about the 70th percentile of most eaten foods - I still say that isn't really a well defined quantity. I think I understand what you're trying to say but I don't understand why you want to know that.
It is about the quality of the database. I want to be sure that the most eaten foods are in the database. the percentile is arbitrary but should do the job. I expected that you say to me, because of the sample size and the population size of austria and your 70 percentile you apply this formula to calculate the needed amount of samples. Isn't this the statesticans way? thank you.
 

fed1

TS Contributor
#6
One way would be to select (at random) say 5 supermarkets from all those in austria and repeat your experiment at each.

I wonder if supermarkets would make available a list of sold inventories for you? Standing by register might creep people out!!!
 
#7
but why 5? what must I consider when chosing the sample size? the inventory thing is not correct, I need the cash register's record, but yes I thought about this too as an alternative
 

noetsi

No cake for spunky
#8
What you have to consider when doing a sample size is how much certainty and how much error you will accept. For a given confidence level (commonly .05) a certain sample size will yield a specific error rate - how much error there is on either side of your results. And it matters how many total stores there are in your population (as well as effect size, but since you will almost never know this, it's normally set at the most conservative assumption).

Here is one tool to calculate sample size I use at work. It suggests some of the factors involved. Note that it is common to oversample if there is a particular strata that does not respond (you then statistically adjust for the oversampling). The key is here that the number given is the actual responses not the number you intend to interview. If you need 200 and you believe only 20% will respond you send out (or visit) 1,000.

http://www.raosoft.com/samplesize.html