What kind of distribution am I dealing with (and how can I identify outliers?)?

slim

New Member
#1
I'm looking at a bunch of data involving how long it takes for an item to get from point A to point B. I have roughly 39,000 items and the associated time for each.

The problem (or maybe it's not a problem?) is that this data is heavily right skewed. For example, about 1000 of the values got to point B on the same day as it left point A, about 7500 took one day, roughly 9000 took 2 days (over 50% took 5 days or less)...but then I have some items that took 100, 200, 300 even 600+ days to make it from point A to point B.

What kind of distribution am I dealing with here, and how can I identify outliers in such a population?
 

CowboyBear

Super Moderator
#2
Hi Slim, welcome.

Unfortunately your post was automatically picked up by our spam filter for some reason; I've released it now. Sorry about that.

To answer your question:
There do exist distributions that describe continuous variables that are bounded to be non-negative. However, with a sample of 39,000, your robustness to non-normal errors in conventional parametric test would make this a non-issue. What are you intending to do with the data specifically, though?

Re. outliers: You could still look at cases more than some threshold number of SD's above the mean, but personally I don't recommend subjectively deleting outliers unless it's clear that a case represents a genuine measurement/recording mistake.