Statistical equivalence and the definition of range


I have a practical problem in the handling of an empirical dataset where normal distribution cannot be assumed.

Let's say my dataset of values, each with a standard deviation, has a distribution that fails tests of equivalence (such as chi square). I wish to express the values as a range. It is common practise to express a range from the minimum to maximum value by choosing the lowest and the highest data points, ignoring the errors on each datum (e.g. 1000+/-100 to 2000+/-50 to be expressed as the range 1000-2000). This seems inappropriate to me, but so does a range that incorporates the standard deviations (e.g. 1000+/-100 to 2000+/-50 to be expressed as the range 900-2050). Instead, shouldn't I consider multiple data in determining the minimum and maximum of a dataset, and use samples from the dataset that yields the lowest and highest mean values that each pass my test of equivalence?

Apologies if the problem is not expresses clearly - I can provide a worked example in Excel if required. Any advice would be greatly appreciated.

Thanks to all!


Active Member
Not sure I understood the question fully but, when it comes to descriptive statistics (mean, quantiles, variability measures), you do not owe anybody anything. Descriptives are just for you, to look and develop your intuition about the data set. You can use whatever you like to come up with research conjectures (as long as there is no selection bias) but then you will have to test your ideas formally, via models and/or statistical tests.
Hi Staassis,

thanks for taking the time to comment. It is right on point - I am looking for a way to talk about data ranges that does not ignore tests of statistical equivalence. In my field, it is common for people in the literature to take the maximum or minimum value of a dataset and ascribe to it a boundary property ('most' or 'least'). However, if that datum can also be modelled with other data as part of a normal distribution, then it seems to me inappropriate to use the datum alone for such a purpose.

To demonstrate my problem, I generated a synthetic dataset of random values with errors between 1000 and 2000. I have attached the result as a jpeg plot and a pdf file with plot and data. From this dataset, I have shown different ways of expressing a range with minimum and maximum values, including my preferred method, quoting means that contain the largest subset of data which pass a test of p>0.05. Do you think this is an appropriate way to deal with such data?

The problem has come up in my work several times, but I haven't found this kind of treatment in my (admittedly limited) reading. In particular, I have recently come up against one of my 'esteemed colleagues' who would argue for special meaning in an arbitrary selection of 'maximum' data points. In my synthetic example, this would be the same as taking, say, the four highest values (1983, 1986, 1994, 1998) and using them to say that the maximum value of the range is >1983. I would argue that, since these four data can belong to a normal population of 1925±19 (n=22, 2sd), and therefore cannot be resolved from younger data at the level of precision of measurement, then the maximum value of the range is significantly lower.


range test.jpg