Wikipedia / Google the "Poisson Distribution". I suspect it might get you in the right direction.
Hello folks. I look after some servers at work, one of which is intermittently failing. In the 38 days since I set up monitoring, there have been ten failures, and five of those failures have occurred between 4am and 5am.
That leads me to assume that something is happening around that time to make it more likely for the server to fail. I'm curious about how one would arrive at that conclusion statistically (or indeed whether my intuition actually turns out to be wrong - that it's entirely within the bounds of probability that 5/10 failures could fall within the same hour of the day).
I'm more interested in the reasoning than in the conclusions (If I can arrive at some numbers to impress my colleagues with it's a bonus, but my job isn't to publish statistics, it's to get the server to work). I've forgotten most of what I learned in my basic college stats courses, but I don't think I ever really learned very much - I learned how to plug numbers into formulae , but Inever really learned how to choose the right statistical tools for a given real-world problem.
The first thing I'd think of would be to compare the results to rolling a 24-sided die ten times and the number 4 coming up five times. If that happened, one would suspect that the die was rigged to favour that number. So my first question is: is my reasoning ok on this - there are things that don't work in the analogy, it's not like the server decides to fail on a given day and then picks a number between 1 and 24 to decide what hour to fail in. And my second question is, do the numbers show that the "die" is "rigged" in this case?
The second thing I'd think of would be to somehow work in a piece of information I've dropped with the dice analogy - the fact that I've been monitoring the server for 38 days. That's 24*38=912 hours, and there have been ten failures, so there's a probability of 10/912=0.01096 of the server failing in a given hour (this is probably complicated by the fact that there's nothing stopping the server failing more than once in the same hour, though this hasn't happened yet). It's been the hour of 4-5am 38 times, and each of those times there's been a 1% chance (rounded off) of a failure. I can remember enough to know that if failures aren't dependent on time of day, there should be a 0.99^38=68% chance of no failures in that hour during the monitoring period. How do I work out the chance of five or more failures, or am I on the right track at all?
I'm sure there's a clearer way of reasoning about the problem than the two approaches I've proposed. This must be a very simple problem if you've got the right tools and know how to apply them.
Thanks!
Wikipedia / Google the "Poisson Distribution". I suspect it might get you in the right direction.
Thanks, definitely on the right track with the Poisson distribution. This is going to be very useful to me.
The examples I've seen are about reasoning forward from a set of assumptions about homogeneity and statelessness to a set of conclusions. If I assume that server failures are no more likely to happen at one time than another (which is the thing I'm not convinced about yet) and that one failure doesn't cause other failures (which I think is a reasonable assumption in this particular case), then I can plug numbers into the distribution to predict the probability of multiple server failures in the same hour.
I'm still a bit confused about two things. One is how to reason backwards - if I get certain results, how I can come to a reasonable conclusion that the events aren't Poisson distributed, in this case that failures aren't homogenous - something is happening at 4-5 am every day to make failures more likely there than elsewhere.
The second thing I'm unsure about is how I divide up the hours. I know now how to predict server failures in a given hour on a given day - I've had ten failures in 38 days, which means an 0.01096 probability of a failure in a given hour on a given day, and plugged into a Poisson distribution that gives a 98.9% chance of no failures in an hour, a 1.1% chance of one failure, and the other numbers not really big enough to bother about.
But I'm grouping by hour of _any_ day - five of my failures have been between 4-5 am on different days of last month. Stop me if I'm doing anything wrong here: a 10/24=0.41 probability of a server failure in a given hour-of-the-day in my 38 day monitoring period. Plugged into Poisson, this gives 0:65.9%, 1:27.4%, 2:5.7%, etc. Cumulatively, there's a 99.993% chance of four or fewer failures in a given hour-of-the-day. So if I get five failures, is it reasonable to conclude that there's almost no chance that this could have happened if failures are homogenous?
Thanks again.
|
|