Generating info about a population from a sample

#21
Rightcoast, midcoast here. Welcome to the forum!

I would add the following to @Dason code. The seed will ensure you get the same output each time in the future and the hist kicks out a visualization.

Code:
dat <- c(10277, 33615, 23442, 11220, 41321, 40801, 20896, 44753, 28659,
         19753, 28760, 24537, 20536, 20959, 5693, 8290, 28715, 41550,
         18459, 49197, 28955, 46149, 25273, 45867, 24716, 43519, 27884,
         37714, 8001, 42151, 43197, 27245, 31736, 9503, 14946)
N <- 9316
set.seed(42)
sums <- replicate(10000, sum(sample(dat, N, replace = TRUE)))
summary(sums)
quantile(sums, c(.025, .975))
hist(sums)
Also, I wrote the SAS code for you as well. Depending on your machine, this takes awhile to run.

Code:
data dat;
input Costs;
datalines;
10277
33615
23442
11220
41321
40801
20896
44753
28659
19753
28760
24537
20536
20959
5693
8290
28715
41550
18459
9197
28955
46149
25273
45867
24716
43519
27884
37714
8001
42151
43197
27245
31736
9503
14946
;
proc surveyselect data=dat
method=urs
sampsize=9316
rep=10000
seed=42
out=boot_dat
outhits;
id costs;
run;

proc means data = boot_dat noprint;
var costs;
class Replicate;
output out= wanted_sums
sum(costs)  = sum_costs;
run;

proc univariate data=wanted_sums noprint;
where _TYPE_ NE 0;
var sum_costs;
output out=Pctl pctlpre =CI95_
pctlpts =2.5  97.5       /* compute 95% bootstrap confidence interval */
pctlname=Lower Upper;
run;

proc print data=Pctl noobs; run;
proc sgplot data=wanted_sums;
where _TYPE_ NE 0;
label sum_costs= ;
histogram sum_costs;
run;
Thanks so much hlsmith, this is really helpful! Really interesting to see the sas steps necessary to make this happen. I have run both the R and SAS versions and it's raised a couple of questions that I'm hoping you can help with:

First off, the two codes don't produce the same results:
In SAS we've got this:
Mean 249 757 864
Lower 247 350 043.5
Upper 252 107 973.5

and R Gives us:
Mean 260 400 000
Lower 258 039 402
Upper 262 791 834

Is this because the structure that the two programs use for the "seed" are different or is there something in the way the two different programs handle the task that would account for the difference?

Second question:

In the SAS output wanted_sums I don't understand what's happening with the first record (replicate = .) It looks almost like a summary variable, giving the overall sum, but it's confusing and it seems to be throwing off the numbers. Any thoughts on what that could be?

Once again, your help is greatly appreciated. I'm just trying to understand what's happening in the code.

Thanks so much

Rightcoast
 

Dason

Ambassador to the humans
#22
I think there was a small mistake in the data import in Sas. If the input different then we wouldn't expect the same results.
 
#23
I think there was a small mistake in the data import in Sas. If the input different then we wouldn't expect the same results.
Nice Catch Dason, I found the "missing 4" which brought things much closer together. They're still a little bit different but definitely much closer than before. It turns out that using the right numbers makes a big difference ;-)

Thanks!
 

hlsmith

Not a robit
#24
Yes, I wrote that pretty quick and didn't double check my work. Honestly, I wondered in the back of my mind if I could have had a data entry error, since I went fast. I corrected the value. Let me know if there were any other key stroke errors.

Also, I would imagine if you increased the number of samples toward infinity they would get pretty close to converging. In a course once, I remember asking a professor if we needed to use a certain seed for a simulation study and he said, well if you all run 1M all the answers should be close enough. I love the season veteran approaches sometimes such as when they just use 2 instead of 1.96 when getting 95% CIs. Yeah, its in the ballpark.
 
#25
Yes, I wrote that pretty quick and didn't double check my work. Honestly, I wondered in the back of my mind if I could have had a data entry error, since I went fast. I corrected the value. Let me know if there were any other key stroke errors.

Also, I would imagine if you increased the number of samples toward infinity they would get pretty close to converging. In a course once, I remember asking a professor if we needed to use a certain seed for a simulation study and he said, well if you all run 1M all the answers should be close enough. I love the season veteran approaches sometimes such as when they just use 2 instead of 1.96 when getting 95% CIs. Yeah, its in the ballpark.
Lol no stats are correct, but some of them are useful. Thanks again for your help!
 

hlsmith

Not a robit
#26
Yeah, I don't know what that overall sum in the datastep is all about, I just dropped it in the code using the,
"where _TYPE_ NE 0;"

I would imagine using a proc sql command would have been better, but I am not that quick using that proc.
 
#27
Yeah, I don't know what that overall sum in the datastep is all about, I just dropped it in the code using the,
"where _TYPE_ NE 0;"

I would imagine using a proc sql command would have been better, but I am not that quick using that proc.
I'll see if I can figure it out, and will post back here if I come up with anything.