Generating info about a population from a sample

#1
Hi All,

I hope this is the right place for this question, but let me know if it should go somewhere else. Here's what I'm trying to figure out

I have a dataset that looks like this:

Unit # 2018 Repair Cost
1 10,277.00
2 33,615.00
3 23,442.00
4 11,220.00
5 41,321.00
6 40,801.00
7 20,896.00
8 44,753.00
9 28,659.00
10 19,753.00
11 28,760.00
12 24,537.00
13 20,536.00
14 20,959.00
15 5,693.00
16 8,290.00
17 28,715.00
18 41,550.00
19 18,459.00
20 49,197.00
21 28,955.00
22 46,149.00
23 25,273.00
24 45,867.00
25 24,716.00
26 43,519.00
27 27,884.00
28 37,714.00
29 8,001.00
30 42,151.00
31 43,197.00
32 27,245.00
33 31,736.00
34 9,503.00
35 14,946.00

The data represents a simple random sample of 35 from a larger population of 9316 units. Based on this information, I am trying to figure out how to get an estimated total repair cost for all the units with 95% confidence intervals for the year. My initial thought was to use this data to calculate a mean repair cost for one unit in 2018 with associated confidence intervals. That I can do pretty easily. My question is, can I then just multiply that mean & upper and lower confidence limits by 9316 to get a total expected repair cost with 95% CI? Intuitively it feels like it should work, but my intuition has gotten me in trouble before in Stats. Any thoughts would be much appreciated.

Thanks so much

Mike
 

Miner

TS Contributor
#2
It may be a little more complicated than that. A histogram of the data shows some gaps, and a probability (Q-Q) plot shows some dog leg bends that may indicate a mixture of 3 possible groups. See the attached graphs for this. If this is true, you will need to identify these groups and apply weighting. repair cost.jpg
 
#4
Can you tell us why you can't just use all the data?
Hi hlsmth, the "units" in this case are apartments. For the purposes of this problem the idea is that a real estate management company has 9316 units (apartments) that they are responsible for. Because the company only has 1 inspector, and the apartments are spread out, they only have the resources to physically check 35 apartments a year (the inspector is pretty slow apparently). They are hoping to use this sample to estimate how much $$ they should budget for maintenance over all 9316 apartments with a 95% confidence interval. Hope that clears things up!

Looking forward to hearing what you think

Mike
 
#5
It may be a little more complicated than that. A histogram of the data shows some gaps, and a probability (Q-Q) plot shows some dog leg bends that may indicate a mixture of 3 possible groups. See the attached graphs for this. If this is true, you will need to identify these groups and apply weighting. View attachment 318
Thanks so much Miner, It certainly does look like there are groups within the sample. Does the explanation I gave to hlsmith help to clarify things at all?
 

hlsmith

Omega Contributor
#6
I wonder if an acceptable enough approach may be to bootstrap the sample of 35 with replacement to create a new sample of 9316 and sum the costs. I would do this say 10,000 times then order the sums in ascending order and the 250th and 9750th values would represent a 95 percentile confidence interval.
 
#7
I wonder if an acceptable enough approach may be to bootstrap the sample of 35 with replacement to create a new sample of 9316 and sum the costs. I would do this say 10,000 times then order the sums in ascending order and the 250th and 9750th values would represent a 95 percentile confidence interval.
that's an interesting idea. Let me take a crack at it in SAS and see what pops out.
 

Miner

TS Contributor
#8
Thanks so much Miner, It certainly does look like there are groups within the sample. Does the explanation I gave to hlsmith help to clarify things at all?
Are the apartments of differing sizes (e.g., 1, 2 or 3 bedrooms), or quality? Or are there different levels of repair (e.g., painting, plumbing, HVAC)? Any of these might explain the grouping.
 
#9
they are, there are a variety of different kinds of apartments in different states of repair. The sample tried to randomly take units from across the whole spectrum.
 

Dason

Ambassador to the humans
#10
By that do you mean that it was purely a random sample or did they somehow put them into groups before hand and tried to make sure to have some from each group?
 

Dason

Ambassador to the humans
#16
The code...
Code:
# Your data
dat <- c(10277, 33615, 23442, 11220, 41321, 40801, 20896, 44753, 28659,
19753, 28760, 24537, 20536, 20959, 5693, 8290, 28715, 41550,
18459, 49197, 28955, 46149, 25273, 45867, 24716, 43519, 27884,
37714, 8001, 42151, 43197, 27245, 31736, 9503, 14946)

# How many total apartments to sample for
N <- 9316

# My code as a one-liner (which is how I wrote it but...)
sums <- replicate(10000, sum(sample(dat, N, replace = TRUE)))

#Breaking it up first we can generate a sample
# with replacement of size N using
sample(dat, N, replace = TRUE)

# What we are really interested in is the sum of these samples
# which is the total repair cost
# so we'll just wrap it all in sum
sum(sample(dat, N, replace = TRUE)

# That gets us a single sample from the "total cost" distribution
# we can use 'replicate' to easily run that code for us multiple times
# (in our case we ask it to run 10000 times)
# and we'll store the result in 'sums'
sums <- replicate(10000, sum(sample(dat, N, replace = TRUE)))

# summarize that distribution
summary(sums)
# if we want an interval that gets us
# the middle 95% of this distribution...
quantile(sums, c(.025, .975))
It's much longer than it needs to be but I added some comments and explanations for your sake.
 
Last edited:
#17
The code...
Code:
# Your data
dat <- c(10277, 33615, 23442, 11220, 41321, 40801, 20896, 44753, 28659,
19753, 28760, 24537, 20536, 20959, 5693, 8290, 28715, 41550,
18459, 49197, 28955, 46149, 25273, 45867, 24716, 43519, 27884,
37714, 8001, 42151, 43197, 27245, 31736, 9503, 14946)

# How many total apartments to sample for
N <- 9316

# My code as a one-liner (which is how I wrote it but...)
sums <- replicate(10000, sum(sample(dat, N, replace = TRUE)))

#Breaking it up first we can generate a sample
# with replacement of size N using
sample(dat, N, replace = TRUE)

# What we are really interested in is the sum of these samples
# which is the total repair cost
# so we'll just wrap it all in sum
sum(sample(dat, N, replace = TRUE)

# That gets us a single sample from the "total cost" distribution
# we can use 'replicate' to easily run that code for us multiple times
# (in our case we ask it to run 10000 times)
# and we'll store the result in 'sums'
sums <- replicate(10000, sum(sample(dat, N, replace = TRUE)))

# summarize that distribution
summary(sums)
# if we want an interval that gets us
# the middle 95% of this distribution...
quantile(sums, c(.025, .975))

# This is pretty close to what we would get if
# we just appealed to the central limit theorem
# and assumed our sample mean and standard deviation
# are the 'truth'
N*mean(dat) + qt(c(.025, .975), length(dat)-1)*sd(dat)*sqrt(N)
It's much longer than it needs to be but I added some comments and explanations for your sake.
This is fantastic. Thank you so much. I just ran it and it looks like it worked perfectly, with one exception. I got this error:

+ # That gets us a single sample from the "total cost" distribution
+ # we can use 'replicate' to easily run that code for us multiple times
+ # (in our case we ask it to run 10000 times)
+ # and we'll store the result in 'sums'
+ sums <- replicate(10000, sum(sample(dat, N, replace = TRUE)))
Error: unexpected symbol in:
" # and we'll store the result in 'sums'

I'm still pretty new at R syntax and can't seem to find the "unexpected symbol". Either way, this was incredibly helpful, and one of my first times using R so a terrific intro. Thank you.

Mike
 

Dason

Ambassador to the humans
#18
I see. Yeah I missed a closing parenthesis in the line

Code:
sum(sample(dat, N, replace = TRUE)
so it was treating everything after that as part of that code. Add a closing paren in there and it should be fine. One hint to that is that when R is still looking for more input before processing (which is nice because that means we don't always have to have all of our code on one line) you'll notice that the beginning of the line starts with "+" instead of being blank or ">".

Everything after the one-liner I didn't actually rerun and was just doing to add the comments explaining what was going on since you haven't used R (much) before. Guess I should have checked that it actually ran haha.
 
#19
I see. Yeah I missed a closing parenthesis in the line

Code:
sum(sample(dat, N, replace = TRUE)
so it was treating everything after that as part of that code. Add a closing paren in there and it should be fine. One hint to that is that when R is still looking for more input before processing (which is nice because that means we don't always have to have all of our code on one line) you'll notice that the beginning of the line starts with "+" instead of being blank or ">".

Everything after the one-liner I didn't actually rerun and was just doing to add the comments explaining what was going on since you haven't used R (much) before. Guess I should have checked that it actually ran haha.
perfect, then when we summarize, the mean is the mean value of all 10,000 sums and the quantiles are my confidence intervals. In this case

Mean total repair cost = 260 400 000
lower CL 258 033 216
Upper CL 262 853 966

which works pretty well with my "calculate for 1 then multiply by 9316" calculation that gave me

Mean total repair cost 260 392 580.66
lower CL 221 214 943
Upper CL 299 570 217

So our CL's are much tighter than I had before.

Does that look right to you?

Thanks again that was a huge help.

Mike
 

hlsmith

Omega Contributor
#20
Rightcoast, midcoast here. Welcome to the forum!

I would add the following to @Dason code. The seed will ensure you get the same output each time in the future and the hist kicks out a visualization.

Code:
dat <- c(10277, 33615, 23442, 11220, 41321, 40801, 20896, 44753, 28659,
         19753, 28760, 24537, 20536, 20959, 5693, 8290, 28715, 41550,
         18459, 49197, 28955, 46149, 25273, 45867, 24716, 43519, 27884,
         37714, 8001, 42151, 43197, 27245, 31736, 9503, 14946)
N <- 9316
set.seed(42)
sums <- replicate(10000, sum(sample(dat, N, replace = TRUE)))
summary(sums)
quantile(sums, c(.025, .975))
hist(sums)
Also, I wrote the SAS code for you as well. Depending on your machine, this takes awhile to run.

Code:
data dat;
input Costs;
datalines;
10277
33615
23442
11220
41321
40801
20896
44753
28659
19753
28760
24537
20536
20959
5693
8290
28715
41550
18459
49197
28955
46149
25273
45867
24716
43519
27884
37714
8001
42151
43197
27245
31736
9503
14946
;
proc surveyselect data=dat
method=urs
sampsize=9316
rep=10000
seed=42
out=boot_dat
outhits;
id costs;
run;

proc means data = boot_dat noprint;
var costs;
class Replicate;
output out= wanted_sums
sum(costs)  = sum_costs;
run;

proc univariate data=wanted_sums noprint;
where _TYPE_ NE 0;
var sum_costs;
output out=Pctl pctlpre =CI95_
pctlpts =2.5  97.5       /* compute 95% bootstrap confidence interval */
pctlname=Lower Upper;
run;

proc print data=Pctl noobs; run;
proc sgplot data=wanted_sums;
where _TYPE_ NE 0;
label sum_costs= ;
histogram sum_costs;
run;
 
Last edited: