correction: assume I sample without replacement, that is I don't replace observations between samples, such that each pseudo group of observations contain unique subsamples
I have a slightly different twist on the normal application of bootstrapping and regression. Here’s the problem… I have a population of people within a particular group (e.g. all the people living within a census tract). I want to build a model to predict a particular continuous outcome at the group level (e.g. predict % of people who will become infected with a disease). For independent variables, I want to use, for example, %male, average age, and average income. The problem is, I have an n of 1. That is, I have only n=1 census tract. Yet I want the model output to be in terms of the hierarchy of the group…so that I can say things like, as the percent of males in the census tract increase by 1 unit, the % of disease infection increases/decreases by x coefficient. Imagine we only have 10 census tracts total, and I want a custom model for each one. So given I only have 10 total census tracts, I cant simply model off that because n=10 isn't really any better than n=1.
My thought was to bootstrap to create pseudo samples. That is, create lots of artificial smaller census tracts by randomly sampling maybe 10% of the census tract population, generating group level stats like “%infected with disease” for the dependent variable, and independent variables of %male, average age, average income of the pseudo census tract. Do this maybe 1000 times. Then build a regression model at this group (hierarchy) level to get those coefficients and predictions. I would sample with replacement; therefore my samples will create a sampling distribution that will look and center around the real means.
Ignoring the spatial aspects of my example, is this a valid approach? Is there a better sampling idea? A better modeling idea? Do I need to consider something like a fixed effects model perhaps if the pseudo-sampling idea is ok??
I’m struggling with how best to generate multiple samples from this scenario. We have only 10 census tracts so therefore I can’t reliably build a regression model using n=10 either. And, I want a separate model for each tract, so we’re back to an n=1.
Any help you can provide would be greatly appreciated.
correction: assume I sample without replacement, that is I don't replace observations between samples, such that each pseudo group of observations contain unique subsamples
Tweet |