# Thread: when to use the median and when the mean ?

1. ## Re: when to use the median and when the mean ?

If the population is a symmetric distribution with the population mean exist, then it is equal to the population median, and both sample mean and sample median are consistent estimator.

In such situation, one advantage of the sample mean over the sample median is that it is a more efficient estimator. Of course, the sample median is more robust as mentioned by many people above.

2. ## Re: when to use the median and when the mean ?

Yes - one property that the mean has is that it gives the smallest sum of squared. The proof isn't difficult.

3. ## Re: when to use the median and when the mean ?

When you say the sample mean is a more efficient estimator, what exactly do you mean?

Originally Posted by BGM
If the population is a symmetric distribution with the population mean exist, then it is equal to the population median, and both sample mean and sample median are consistent estimator.

In such situation, one advantage of the sample mean over the sample median is that it is a more efficient estimator. Of course, the sample median is more robust as mentioned by many people above.

4. ## Re: when to use the median and when the mean ?

http://en.wikipedia.org/wiki/Efficie...tic_efficiency

In short the estimator has a smaller asymptotic variance is better and more efficient.

5. ## Re: when to use the median and when the mean ?

Ah. See most people talk about the median being more robust to outliers because we don't typically care if the stuff in the middle is shifted slightly. But consider the following case:

Dataset 1: 1, 2, 3, 4, 5
Dataset 2: 1, 2, 3, 4, 10000000

The median in both cases is 3. The mean in the first case is 3 and the mean in the second case is... 2000002. Quite a difference. That one outlier significantly changes the mean but the median isn't changed in this case.

6. ## Re: when to use the median and when the mean ?

I think because unless our population distribution is not perfectly symmetrical, or its size is reasonably large, median cannot indicate the center point. Consider those small asymmetries in one small population as micor-outliers which can make the median doesn't work perfectly, unless the number of those micro-outliers with positive and negative effect on location of median gets very high (neutralizing each other) or get zero. However, those micro-outliers don't affect mean because in calculating mean all of values are actually summed up, but in calculating median we only check which value is in the middle.

--------

Originally Posted by Dason
You don't define what you mean by micro-outliers but your whole post just feels wrong. Median as tokai points out is better in the case when there are outliers because the outliers don't affect it as much.
Thanks Dason (and other guys).

I was wrong about the center point in the first place! I thought it is where the apex of population stands (closer to mean), but apparently it is somewhere in the middle of maximum and minimum regardless of the distribution shape (exactly where median stands!). Based on what I learned here (that the center is important regardless of distribution shape), I think none of these can replace the other and both should be always reported (plus other interquartiles maybe) to show both center and apex of population.

By micro-outliers, I meant small deviations from the normal distribution that don't change the distribution to the extent that a KS test detects it as a non-normal population. I think mean is better because it can convey the effect of these small changes. For example in the sample [1, 2.5, 3, 5.5, 7] and [1, 2, 3.5, 5, 7], the change in the median value (3 to 3.5) depends on the fact that what number is in the middle (so when the changes are small, it does not depend on the other values). However the change in the mean value (3.8 to 3.7) represents "all" of those small changes and therefore might be more relevant to the changes in population properties.

------------

Thanks. OK I got the strength of medians against outliers by your example.

However, consider this one:

1, 2, 3, 4, 1000000, 2000000000, 300000000000000

In the above sample, the median (= 4) would tell nothing useful about the population (I think these huge values are not considered outliers here because they apparently follow a regular logarithmic-like pattern of increase [also the small values are not outliers here])

------------------------

and if we compare the above population with this one:
1, 2, 3, 4, 200000000000, 5000000000000000000, 90000000000000000000000000

The median (still = 4) is unable to show the huge (and important) changes from the population #1 to the population #2, but the mean can show the changes properly.

----------------------

About the importance of slight shifts of the stuff in the middle, I think they might actually be important at least in many fields of medicine (drugs etc.). Many clinical studies needing participants have rather small sample sizes. In such a sample, I would care for every slight change in the average value to find something valuable (is it a bias factor?). Especially when a toxin or drug is being studied which forces the researcher to look for every small changes.

7. ## Re: when to use the median and when the mean ?

But I was more or less just giving an example of why we refer to median as more robust against outliers and why it tends to be preferred when there are outliers. One could argue that 4 is still a better estimate of 'centrality' for both of those datasets you provide. It's just that centrality might not be the only interesting thing about a dataset.

On the topic of means and medians - there are nice properties for both estimators - but really they're estimating slightly different things. It's situation dependent on what we typically care about. If you want you could use any one of the other types of estimators of 'centrality parameters'. A few have been mentioned: trimmed mean, winsorized mean, trimean, you could estimate the mode of the distribution by some method (if it's continuous create the kernel density estimate and then use the peak to estimate the mode - if it ends up unimodal).

Really though for these small datasets it's hard to claim that there is a 'right' measurement of centrality. If you want more discussion I know there are quite a few threads at the stats stack exchange site.

8. ## The Following User Says Thank You to Dason For This Useful Post:

victorxstc (01-08-2012)