System Response Time Alerts - Choosing a Statistical Method

It's been a long time since I had to use statistics in anger, so please bear with me.

I have a test running that measures the response time of an IT System.

I would like to come up with some alerting if the response times indicate that there is an issue.

I think the most useful single descriptive statistic to based an alert on would be the response time we'd expect to meet say 99% of the time.

Is that called a confidence interval?

How can I choose a probability distribution that represents the sample of data I have (the test results). I think a Poisson distribution is called for, but I don't know why I think that.

Unfortunately (maybe), the data is pretty "rough". I can only measure the response time to 1 second resolution, and I will only have about 40 test results in each grouping I want to report on.

Is the best thing to post some example data here?

Hi, as far as I understand, you have 40 measurements in terms of seconds how long the system needs to response, in case everything works well. And based on these data you want to extract a time-threshold in order to say there is porbably something wrong since the system needs to much time to respond. Am I right? In this case I think it is difficult to assume an a priori distribution. I think the first step would be to have a look at the data (e.g. via a Histogram). Subsequently you can decide which function fits best, normal distribution and indeed Poisson distribution are good candidates. Then you fit such a function and calculate the corresponding quantile.

Another probability would be that you calculate the 99% quantile directly based on your data (in R this would be via "quantile(data, prob=c(0.99))") which does not need to assume a certain function because it directly uses your data to calculate quantiles.