# A pointer to the correct approach, please?

#### GaiusScotius

##### New Member
Please excuse what may sound to be dumb question to people who understand statistics, but I'm stumped. I'm also a lawyer stretching my high school math, so please be gentle on that front ...

I suspect this question is an extension to the earlier thread entitled "Extrapolation - gearing info of a population from a sample".

I am analysing how many man-hours particular types of legal matters have taken to complete with a view to making better estimates in the future. I have a sample of about 200 that we've dealt with over the past few years, ranging from 5 to 13.5 man-hours (rounded to the nearest 15 minutes). The mean is about 9.5 man-hours.

Plotting a histogram the distribution looks log-normal ('ish). That may be just what I'm expecting to see, but I suspect it's right on the basis that here must be some minimum time greater than zero required to perform any piece of work and, as there are so many more ways for things to go wrong than right, there seems no reason to think there's some upper limit to how many many hours the work could take (other than the client pulling the plug on the transaction).

I can see, however, that a log-normal distribution admits that any number of man-hours greater than zero is possible, but in reality there must be a hard, lower limit greater than zero to how long its takes to perform any task. For example, if my flat out, error free typing speed is 100 words per minute a 1,000 word letter will take me 10 minutes to type; depending on the number of typos I make it could take longer, but it can't be shorter.

Q1: Is there another distribution that more accurately describes this type of behaviour than the log-normal?

Next, I'd like to be able to make statements along the lines of "to date it has taken us on average 9.5 man-hours to perform work like this, the longest we've seen has been 13.5 man-hours and I'm XX% certain that similar work it won't exceed YY man-hours".

I thought about fitting a log-normal curve to the sample data (thank you R) and calculating the value for some arbitrarily high confidence limit (say 99%), but then realised that approach would only be correct if the sample distribution was a very close fit to the population distribution. That may be true for this particular dataset as it's reasonably large, but there are other datasets where we only have a handful (10s, not hundreds) of samples. In general the assumption won't hold.

If I was dealing only with the mean number of man-hours, the standard error (variance) of the mean would work, which I can calculate using the t-statistic:

m* ± t * s / √(n)
[m* = sample mean, s = sample standard deviation, n = number of samples, t = t-statistic for n-1 degrees of freedom]

I could then make statements like "I am 99.7% certain that the mean (of the population) will not exceed the (sample) mean plus three time the standard error of the (sample) mean".

However, when it comes to calculating an equivalent to the standard error of the mean for an arbitrary confidence level, and beyond appreciating that it must in some manner incorporate both error in the (sample) mean and error in the (sample) variance, I'm completely stuck. I'm also concerned that I may be chasing a wild goose...

Q2: Is there a statistic I can use?

Thanks.