How is simulation used to check differences of regression techniques.

gmoj

New Member
#1
I was reading a paper on Bootstrapping with models for count data. (Page 1170)

www.researchgate.net/publication/51738951_Bootstrapping_with_Models_for_Count_Data or www.ncbi.nlm.nih.gov/pubmed/22023684

On the paper the author uses the conventional MLE to fit Poisson models but also uses some bootstrap technique to fit the same model (Re-samples the observations 5000 times).
The bootstrap technique gives larger standard errors for the estimates than the conventional method. The author then goes ahead to state that;

"A simulation analysis was conducted to check the differences in the results from the study. In total 1000 sets of data similar to the observed data were simulated..........."

The author then says that a model is then fit (using the simulation) and the results shows that standard errors from the conventional analysis was 3% too low.

My questions:
1. Could it men that data that results from simulation is the real/actual data so that we can use it to compare performance of the original observed data and that from bootstraps?

2. The observed data has 287 observations, how can we obtain "1000 sets of data" by simulation, because I thought that could be bootstrapping again.

3. What could have been the method used to run the simulated "1000 sets of data"?
 

spunky

King of all Drama
#3
1. Could it men that data that results from simulation is the real/actual data so that we can use it to compare performance of the original observed data and that from bootstraps?
Well, simulated data is never the real or actual data itself (hence it's name "simulated"). But a good simulation study will need the computer to generate data that looks reasonably close to data you'd find in your area of expertise or some idealized cases where you have extra control over the characteristics of the data (like non-normality, missing data, etc.) so that you can see what effect these characteristics have on the statistical methods.

For the case of your article there's this line that reads:

...data similar to the observed data were simulated with the expected values of counts given by Eq. (4) and with the Poisson error inflated by the factor 2.75 using the zero inflated count model where an observed count is either the value zero with probability p or a random value from a Poisson distribution with probability 1 − p.

Which implies they are using parameters estimated from the real data as their population parameters in their simulation study.


2. The observed data has 287 observations, how can we obtain "1000 sets of data" by simulation, because I thought that could be bootstrapping again.
Because each data set is created by the computer with the population and distributional characteristics described above. Once the distribution from which the data will be sampled is defined in the computer, you can ask it to give you any number of random draws that will become the datasets to analyze. So each dataset is "new" in the sense that it's being sampled from the distribution defined by the authors.

3. What could have been the method used to run the simulated "1000 sets of data"?
I'm not sure I follow what you're asking here. Do you mean which method was used to analyze each dataset? Or what method is it used to generate each dataset?
 

gmoj

New Member
#4
Well, simulated data is never the real or actual data itself (hence it's name "simulated"). But a good simulation study will need the computer to generate data that looks reasonably close to data you'd find in your area of expertise or some idealized cases where you have extra control over the characteristics of the data (like non-normality, missing data, etc.) so that you can see what effect these characteristics have on the statistical methods.

For the case of your article there's this line that reads:

...data similar to the observed data were simulated with the expected values of counts given by Eq. (4) and with the Poisson error inflated by the factor 2.75 using the zero inflated count model where an observed count is either the value zero with probability p or a random value from a Poisson distribution with probability 1 − p.

Which implies they are using parameters estimated from the real data as their population parameters in their simulation study.




Because each data set is created by the computer with the population and distributional characteristics described above. Once the distribution from which the data will be sampled is defined in the computer, you can ask it to give you any number of random draws that will become the datasets to analyze. So each dataset is "new" in the sense that it's being sampled from the distribution defined by the authors.



I'm not sure I follow what you're asking here. Do you mean which method was used to analyze each dataset? Or what method is it used to generate each dataset?

Your response is very insightful thanks, but I want to ask that how should I then simulate the independent variables;
Suppose I have a model and I use the coefficients to simulate the outcome; How then should I simulate the independent variables because I don't know how they are distributed?

I was writing the same thing in SAS to simulate it.....this is what I am talking about..... (You don't have to follow the codes just check the comments)

%let N = 287;
%let nCont = 4;

data SimReg1(keep= Y lambda x : );
call streaminit(54321);
array x[&nCont];

array beta[0:&nCont] _temporary_ (1.6362 -0.6134 -0.4914 -0.1328 0.0324); /* This part helps include the coefficients from some model */

do i = 1 to &N;
/*
x[1] = rand("Bernoulli",0.5);
x[2] = rand("Bernoulli",0.5); /* How should I simulate this part? What is includede in the code are just something I was */
x[3] = ceil(3 * rand("Uniform")); /* trying but certainly I don't know how these variables are distributed. */
x[4] = rand("Bernoulli",0.5); */

eta = beta[0];
do j = 1 to &nCont;
eta = eta + beta[j] * x[j];
end;
lambda=exp(eta);

Y = rand("Poisson",lambda) ; /* This simulates the dependent variable in counts */
output;
end;
run;
 

spunky

King of all Drama
#5
Suppose I have a model and I use the coefficients to simulate the outcome; How then should I simulate the independent variables because I don't know how they are distributed?
From a quick glance I don't see how you could get the necessary information to reproduce their simulation solely from what's published in that article. You'd need the actual dataset to try and do that or make assumptions that may or may not end up being true.