comparing the DEGREE of difference between data sets

Hi all,

I'm wondering if there is a way to assess the degree of difference between data sets. For instance, say I have 4 groups of proportions (or 4 data sets that these proportions were generated from), A, B, C, and D. Is there a way to determine which group out of A, B, and C is most alike to D, which is least alike to D, etc...

I hope this isn't too vague!
My initial approach was to attempt to use a multinomial goodness of fit test, but the data sets (A,B, and C) that I want to compare to my reference data set (D) are ALL significantly different from D. As a result, I want to at least be able to distinguish A,B, and C in HOW dissimilar they are to D.

I then tried using non-metric multi-dimensional scaling, and while this gives me a graphical representation of the kind of results I'm looking for, I'm wondering if there is a way to generate a quantitative measure of these differences.

Thanks again


Super Moderator
Prior to generating an NMDS plot, the program would to have constructed a similarity matrix. Based on Bray-Curtis, Jaccards, or other similarity measures. You can use these to interpret your samples relationship to one another.
Awesome bugman, thanks for the help. In terms of reporting results in my similarity matrix (my software actually produces a distance matrix), if, for instance, I have an entry of 0.49 in comparing group A to D, could I report this by saying that 49 percent of the time, group A is similar to group D?


Super Moderator
In terms of reporting, I use dissimilarity/similarity interchangeably depending on the context.

I'd need to know a little more about which similarity measure you use and your data (what are your groups / dependent variables etc...), because not all of them are directly interpretable in that sense.
Thanks for getting back, bugman.

I'm analyzing some forestry data I've collected, and what I'm comparing is proportions of tree species 3 different companies have cut down (groups A, B, and C being each of these 3 companies). Group D is data representing what proportions of species are actually there on the landscape I'm looking at. So I'm comparing what each company has cut to what is actually out there, to see if their logging is "representative".

for each company, my data is in the form of proportions for 8 different tree species, summing to 1.

So if between company "A" and the landscape data, I have a distance entry of .49, would this mean that in 51% of cases, taking a randomly logged area from company A would have the same species as taking a random area from the landscape as a whole?



Super Moderator
you are on the right track, but I still need to know which distance measure you are using. Some are not as straight forward to interpret as others:yup:


Super Moderator
Because Euclidean distance is simply the geometric distance in the multidimensional space, it cannot be interpreted as % difference or similarity among sites.

It should also be noted that because you are dealing with a distance matrix, you are no longer comparing differences or likenesses in species, you are comparing differences between sites.

Euclidean distance tends to emphasise how many species are different rather than how many are shared and emphasises total abundances (which I don't think you are trying to do).

Other options for you would be the chi squared distance, which is essentially the Euclidean distance but using proportions (as in your data) and emphasises composition changes rather than abundances.

Another option for you could be Jaccards (for presence absence) - however, if all the tree species are present at all of your sites this makes no sense. The thing about Jaccards though, is it is directly interpretable as the % of unshared species between sites and logically its inverse is a measure of shared species (as a percentage) and as it the urns out this is also a measure of beta diversity. Cool.

Bray Curtis is also commonly used in ecology for abundance data, and is another measure that can be directly interpretable as a percentage.

So what do you do?

you need to decide on what information within your data is most important:

1) is abundance or composition more important?
2) do joint absences matter? (if a species is absent at two sites, would this make the two sites more similar or should this be ignored?).

From what I see, Euclidean is not the best idea and I would tend toward Chi squared distance since it is based on proportions and counts.

I hope I haven't made things worse. If I have, sorry, but repost and I'll try and help some more.



TS Contributor

I have quietly followed this interesting discussion, since I am interested (generally speaking, and due to my research interests) in issues related to comparisons between and among assemblages.

I agree with bugman, and I agree on the use of chi-square distance...
I was thinking about chi-square distance because we are dealng here with "profiles" (proportions of categories among assemblages) and profiles recall (at least to my mind) the logic of Correspondence Analysis, which is based on the very concept of profiles and chi-square distances.



Super Moderator
Thanks for the follow up GM.


and now to finally answer the question you asked in the first place:

If you decide on Euclidean or Chi Square distance, the answer is no, you cannot express the distances as percentage similarity with another site. If you use Bray Curtis or Jaccards (that both have upper limits (range: 0-1 or 0-100), then yes you can.

The best thing to do for CS for example (remembering that 0=similar and deviations from this are increasing levels of dissimilarities) is to say something like this:

Chi-squared distances varied between test sites and the reference sites (range x-y). The smallest Chi squared distance occurred between site D and site A (X2 = 0.15 (for example)) indicating that the community managed by company A is following the regulations more closely than company b etc... (Figure x).

Finally, If you are comparing CS distances between sites then a percentage can be used to say something like: test sites a and b were on approximately 35% different from Site c , for example.

Thanks for all the input guys, I really appreciate it.

Your comments have made me wonder whether the type of data I have would allow me to calculate these different distance measures, or even my original euclidean distances properly.

The data for my groups A, B, and C (logging companies) is in the form of a proportion for each of 8 tree species. For each group, I generated the proportion for tree species X by summing the the sizes (in Hectares) of all cut blocks whose leading species was species X, and dividing this by the total number of hectares logged by that company.

So my raw data for groups A, B, and C is in the form of size and leading species for hundreds of cutblocks.

However, for group D, what I have is the size (in hectares) of my entire study area, and the total area within this that species X is the leading species. I have divided these to get my overall proportions for each species.

So when I generated my euclidean distance matrix, my input data for each group was the 8 tree species proportions, but was generated differently for group D than for groups A,B, and C

Sorry if this is convoluted, this is the best way I can think of to describe what I've done.

So I guess my question would be, given the type of data I have, are all these distance measures still on the table?

Thanks again!


Super Moderator
I only sort of follow but, since you are using proportions you should be fine.

If in doubt, you might want to standardise your data befores anlaysis to account for different areas.

Im curious now to know what software you are using.
Hey bugman,

I'm using R for my analysis.

So all I'm really interested in being able to say is that one group was more similar to group D than another group. It appears the consensus is that chi-squared distance will best do this, so I will try that. Although it seems R does not have a function to calculate this.
Thanks for all the help bugman, I appreciate the time and effort.

I can't seem to find the SIMIL package for R, and so think I will settle on euclidean distances.



Ambassador to the humans
I can't seem to find it either. I can't say I looked too hard but I couldn't find it on the CRAN list of contributed packages either.