Yes - one property that the mean has is that it gives the smallest sum of squared. The proof isn't difficult.
If the population is a symmetric distribution with the population mean exist, then it is equal to the population median, and both sample mean and sample median are consistent estimator.
In such situation, one advantage of the sample mean over the sample median is that it is a more efficient estimator. Of course, the sample median is more robust as mentioned by many people above.
Yes - one property that the mean has is that it gives the smallest sum of squared. The proof isn't difficult.
http://en.wikipedia.org/wiki/Efficie...tic_efficiency
In short the estimator has a smaller asymptotic variance is better and more efficient.
Ah. See most people talk about the median being more robust to outliers because we don't typically care if the stuff in the middle is shifted slightly. But consider the following case:
Dataset 1: 1, 2, 3, 4, 5
Dataset 2: 1, 2, 3, 4, 10000000
The median in both cases is 3. The mean in the first case is 3 and the mean in the second case is... 2000002. Quite a difference. That one outlier significantly changes the mean but the median isn't changed in this case.
I think because unless our population distribution is not perfectly symmetrical, or its size is reasonably large, median cannot indicate the center point. Consider those small asymmetries in one small population as micor-outliers which can make the median doesn't work perfectly, unless the number of those micro-outliers with positive and negative effect on location of median gets very high (neutralizing each other) or get zero. However, those micro-outliers don't affect mean because in calculating mean all of values are actually summed up, but in calculating median we only check which value is in the middle.
--------
Thanks Dason (and other guys).
I was wrong about the center point in the first place! I thought it is where the apex of population stands (closer to mean), but apparently it is somewhere in the middle of maximum and minimum regardless of the distribution shape (exactly where median stands!). Based on what I learned here (that the center is important regardless of distribution shape), I think none of these can replace the other and both should be always reported (plus other interquartiles maybe) to show both center and apex of population.
By micro-outliers, I meant small deviations from the normal distribution that don't change the distribution to the extent that a KS test detects it as a non-normal population. I think mean is better because it can convey the effect of these small changes. For example in the sample [1, 2.5, 3, 5.5, 7] and [1, 2, 3.5, 5, 7], the change in the median value (3 to 3.5) depends on the fact that what number is in the middle (so when the changes are small, it does not depend on the other values). However the change in the mean value (3.8 to 3.7) represents "all" of those small changes and therefore might be more relevant to the changes in population properties.
------------
Thanks. OK I got the strength of medians against outliers by your example.
However, consider this one:
1, 2, 3, 4, 1000000, 2000000000, 300000000000000
In the above sample, the median (= 4) would tell nothing useful about the population (I think these huge values are not considered outliers here because they apparently follow a regular logarithmic-like pattern of increase [also the small values are not outliers here])
------------------------
and if we compare the above population with this one:
1, 2, 3, 4, 200000000000, 5000000000000000000, 90000000000000000000000000
The median (still = 4) is unable to show the huge (and important) changes from the population #1 to the population #2, but the mean can show the changes properly.
----------------------
About the importance of slight shifts of the stuff in the middle, I think they might actually be important at least in many fields of medicine (drugs etc.). Many clinical studies needing participants have rather small sample sizes. In such a sample, I would care for every slight change in the average value to find something valuable (is it a bias factor?). Especially when a toxin or drug is being studied which forces the researcher to look for every small changes.
Last edited by victorxstc; 01-08-2012 at 01:27 PM.
But I was more or less just giving an example of why we refer to median as more robust against outliers and why it tends to be preferred when there are outliers. One could argue that 4 is still a better estimate of 'centrality' for both of those datasets you provide. It's just that centrality might not be the only interesting thing about a dataset.
On the topic of means and medians - there are nice properties for both estimators - but really they're estimating slightly different things. It's situation dependent on what we typically care about. If you want you could use any one of the other types of estimators of 'centrality parameters'. A few have been mentioned: trimmed mean, winsorized mean, trimean, you could estimate the mode of the distribution by some method (if it's continuous create the kernel density estimate and then use the peak to estimate the mode - if it ends up unimodal).
Really though for these small datasets it's hard to claim that there is a 'right' measurement of centrality. If you want more discussion I know there are quite a few threads at the stats stack exchange site.
victorxstc (01-08-2012)
Tweet |