Hi,
I intend to estimate household level gas consumption for a city with a population of 1 million and 200,000 houses through a survey. The city comprises of 10 small towns each of around 20,000 houses. Out of these 200,000 houses only 10,000 (scattered across the 10 towns) have gas meters installed and are billed on the basis of actual consumption. The remaining 190,000 are being billed on presumptive basis (so no actual consumption figure available).
As I want to determine actual consumption for each household, I have the following questions:
1. Should I pick my entire sample from the 10,000 houses (with meters), and would it be considered representative for the whole city?
2. If I randomly pick the towns and household within towns does my sample suffer from selection bias?
3. What could be an appropriate formula for sample size determination?
Thanks for your help.

1. Whether only the 10,000 houses with gas meters should be measured depends on how reliable you believe the estimated gas consumption for the other 190,000 households to be (if it were my house, I would hope I'm being charged fairly!). As for whether those 10,000 households are representative of the whole city or not, then you would need to ask, is there any reason that houses with a gas meter may have different gas consumption trends than houses that do not? If so, then those houses may not be wholly representative.

2. If you randomly pick the towns, and then randomly choose houses within the towns, then you wouldn't have to worry about selection bias. However, be sure that the estimator you use for average gas consumption is that intended for two-stage cluster sampling (the sampling scheme you just described) rather than that for a simple random sample.

3. Are you talking about the number of towns, the number of houses within towns, or both? Determining these would require estimates of the variance in gas consumption, including the within-town variance, as well as a factoring in of survey costs.
...
I was about to post a formula, but it occurred to me that this sounds suspiciously like a homework question. If so, be sure to label it as such, as I would not want to be complicit in acts of academic dishonesty.

Thanks for your reply.
1. First to clarify this is not a homework question; it is something I am trying to resolve as a part of economics research project where we plan to determine own price elasticity of demand. I just simplified the numbers for ease of explanation.
1. Looking at their geographical spread, I would say that the 10,000 households are representative of the entire city. I would try to confirm this by confirming their representative status on the basis of spatial location, and other attributes such as house size, household size, income etc. Any idea what test I can use for this? I suppose such tests would be enough for confirmation. Right?
2. The clue to use two-stage clustering is really helpful and I will follow this while using the estimators.
3. Yes I am talking about number of towns and number of houses within towns. Please elaborate a little more how this will effect the factoring of survey costs. Also I am not sure how to determine share of each town in the random sample.
4. If you can post a formula it would be great as I am an Econ person and never got a formal training in survey and sampling.
I am not sure when using any formula for sample size if my population would be 200,000 or just 10,000. Also do I have to calculate sample size for the whole population and then allocate surveys to houses in each town according to proportion of houses in each town or do I have to determine sample for each randomly picked town and then sum up all of them to get the total sample size.
Thanks again for your time and knowledge sharing.

1. That would be an indirect way of assessing their representativeness, but if these other factors associate pretty well with gas consumption, then it does sound reasonable. You could use standard tests to compare the distributions between these attributes.

2. Is there any reason for only choosing a sample of towns as opposed to all of them? If choosing all 10 towns is not a big deal, then you could effectively consider it a stratified sampling scheme, in which case you would have much more precision in your estimates, since you eliminate the extra variability due to the random selection of towns.

3. Sample size determination in surveys is typically a trade-off between cost (e.g., time and expenses required for a measurement) and precision. If the measurement costs are more expensive for a particular town, for instance, then that would have to balanced with the loss win precision from sampling less. If the costs are the same for each household, however, then that simplifies things.

As I mentioned before, if the number of towns is small, and there is interest in town-specific averages, then a stratified sampling scheme would make sense, with each town as a stratum. If you're starting out with a predetermined total sample size (let's call this number ), and would like the proportion of houses from each town in the sample to reflect that of the population, then you can simply take a random sample from the entire pool of households, and each stratum would be self-weighting.

If you need to figure out what should be, then you'll need to decide on , your desired margin of error. A formula you could use is

where is usually by convention and is related to the variance of the overall estimated average. Unfortunately it's usually not possible to calculate directly, so it has to be an educated guess of what you think the variance of the estimated mean gas consumption might be (). This would have to be based on prior research or perhaps a pilot study.

If you're interested, a useful resource is Sampling: Design and Analysis by Sharon Lohr.