+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 15 of 24

Thread: comparing the DEGREE of difference between data sets

  1. #1
    Points: 1,473, Level: 21
    Level completed: 73%, Points required for next Level: 27

    Posts
    13
    Thanks
    0
    Thanked 0 Times in 0 Posts

    comparing the DEGREE of difference between data sets




    Hi all,

    I'm wondering if there is a way to assess the degree of difference between data sets. For instance, say I have 4 groups of proportions (or 4 data sets that these proportions were generated from), A, B, C, and D. Is there a way to determine which group out of A, B, and C is most alike to D, which is least alike to D, etc...

    I hope this isn't too vague!
    thanks

  2. #2
    Points: 1,473, Level: 21
    Level completed: 73%, Points required for next Level: 27

    Posts
    13
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: comparing the DEGREE of difference between data sets

    My initial approach was to attempt to use a multinomial goodness of fit test, but the data sets (A,B, and C) that I want to compare to my reference data set (D) are ALL significantly different from D. As a result, I want to at least be able to distinguish A,B, and C in HOW dissimilar they are to D.

    I then tried using non-metric multi-dimensional scaling, and while this gives me a graphical representation of the kind of results I'm looking for, I'm wondering if there is a way to generate a quantitative measure of these differences.

    Thanks again

  3. #3
    Super Moderator
    Points: 31,766, Level: 100
    Level completed: 0%, Points required for next Level: 0
    bugman's Avatar
    Posts
    2,255
    Thanks
    290
    Thanked 324 Times in 265 Posts

    Re: comparing the DEGREE of difference between data sets

    Prior to generating an NMDS plot, the program would to have constructed a similarity matrix. Based on Bray-Curtis, Jaccards, or other similarity measures. You can use these to interpret your samples relationship to one another.
    The earth is round: P<0.05

  4. #4
    Points: 1,473, Level: 21
    Level completed: 73%, Points required for next Level: 27

    Posts
    13
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: comparing the DEGREE of difference between data sets

    Awesome bugman, thanks for the help. In terms of reporting results in my similarity matrix (my software actually produces a distance matrix), if, for instance, I have an entry of 0.49 in comparing group A to D, could I report this by saying that 49 percent of the time, group A is similar to group D?

  5. #5
    Points: 1,473, Level: 21
    Level completed: 73%, Points required for next Level: 27

    Posts
    13
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: comparing the DEGREE of difference between data sets

    Or sorry, since it is a distance matrix, that would be 51 percent of the time.

  6. #6
    Super Moderator
    Points: 31,766, Level: 100
    Level completed: 0%, Points required for next Level: 0
    bugman's Avatar
    Posts
    2,255
    Thanks
    290
    Thanked 324 Times in 265 Posts

    Re: comparing the DEGREE of difference between data sets

    In terms of reporting, I use dissimilarity/similarity interchangeably depending on the context.

    I'd need to know a little more about which similarity measure you use and your data (what are your groups / dependent variables etc...), because not all of them are directly interpretable in that sense.
    The earth is round: P<0.05

  7. #7
    Points: 1,473, Level: 21
    Level completed: 73%, Points required for next Level: 27

    Posts
    13
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: comparing the DEGREE of difference between data sets

    Thanks for getting back, bugman.

    I'm analyzing some forestry data I've collected, and what I'm comparing is proportions of tree species 3 different companies have cut down (groups A, B, and C being each of these 3 companies). Group D is data representing what proportions of species are actually there on the landscape I'm looking at. So I'm comparing what each company has cut to what is actually out there, to see if their logging is "representative".

    for each company, my data is in the form of proportions for 8 different tree species, summing to 1.

    So if between company "A" and the landscape data, I have a distance entry of .49, would this mean that in 51% of cases, taking a randomly logged area from company A would have the same species as taking a random area from the landscape as a whole?

    Thanks

  8. #8
    Super Moderator
    Points: 31,766, Level: 100
    Level completed: 0%, Points required for next Level: 0
    bugman's Avatar
    Posts
    2,255
    Thanks
    290
    Thanked 324 Times in 265 Posts

    Re: comparing the DEGREE of difference between data sets

    you are on the right track, but I still need to know which distance measure you are using. Some are not as straight forward to interpret as others
    The earth is round: P<0.05

  9. #9
    Points: 1,473, Level: 21
    Level completed: 73%, Points required for next Level: 27

    Posts
    13
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: comparing the DEGREE of difference between data sets

    woops, I forgot to include that. My software uses euclidean distance by default, so this is what I have been using.

    cheers

  10. #10
    Super Moderator
    Points: 31,766, Level: 100
    Level completed: 0%, Points required for next Level: 0
    bugman's Avatar
    Posts
    2,255
    Thanks
    290
    Thanked 324 Times in 265 Posts

    Re: comparing the DEGREE of difference between data sets

    Because Euclidean distance is simply the geometric distance in the multidimensional space, it cannot be interpreted as % difference or similarity among sites.

    It should also be noted that because you are dealing with a distance matrix, you are no longer comparing differences or likenesses in species, you are comparing differences between sites.

    Euclidean distance tends to emphasise how many species are different rather than how many are shared and emphasises total abundances (which I don't think you are trying to do).

    Other options for you would be the chi squared distance, which is essentially the Euclidean distance but using proportions (as in your data) and emphasises composition changes rather than abundances.

    Another option for you could be Jaccards (for presence absence) - however, if all the tree species are present at all of your sites this makes no sense. The thing about Jaccards though, is it is directly interpretable as the % of unshared species between sites and logically its inverse is a measure of shared species (as a percentage) and as it the urns out this is also a measure of beta diversity. Cool.

    Bray Curtis is also commonly used in ecology for abundance data, and is another measure that can be directly interpretable as a percentage.

    So what do you do?

    you need to decide on what information within your data is most important:

    1) is abundance or composition more important?
    2) do joint absences matter? (if a species is absent at two sites, would this make the two sites more similar or should this be ignored?).

    From what I see, Euclidean is not the best idea and I would tend toward Chi squared distance since it is based on proportions and counts.

    I hope I haven't made things worse. If I have, sorry, but repost and I'll try and help some more.

    P
    The earth is round: P<0.05

  11. #11
    TS Contributor
    Points: 40,715, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Downloads
    gianmarco's Avatar
    Location
    Italy
    Posts
    1,369
    Thanks
    232
    Thanked 302 Times in 226 Posts

    Re: comparing the DEGREE of difference between data sets

    Hi!

    I have quietly followed this interesting discussion, since I am interested (generally speaking, and due to my research interests) in issues related to comparisons between and among assemblages.

    I agree with bugman, and I agree on the use of chi-square distance...
    I was thinking about chi-square distance because we are dealng here with "profiles" (proportions of categories among assemblages) and profiles recall (at least to my mind) the logic of Correspondence Analysis, which is based on the very concept of profiles and chi-square distances.

    Regards,
    Gm

  12. #12
    Super Moderator
    Points: 31,766, Level: 100
    Level completed: 0%, Points required for next Level: 0
    bugman's Avatar
    Posts
    2,255
    Thanks
    290
    Thanked 324 Times in 265 Posts

    Re: comparing the DEGREE of difference between data sets

    Thanks for the follow up GM.

    psylocat,

    and now to finally answer the question you asked in the first place:

    If you decide on Euclidean or Chi Square distance, the answer is no, you cannot express the distances as percentage similarity with another site. If you use Bray Curtis or Jaccards (that both have upper limits (range: 0-1 or 0-100), then yes you can.

    The best thing to do for CS for example (remembering that 0=similar and deviations from this are increasing levels of dissimilarities) is to say something like this:

    Chi-squared distances varied between test sites and the reference sites (range x-y). The smallest Chi squared distance occurred between site D and site A (X2 = 0.15 (for example)) indicating that the community managed by company A is following the regulations more closely than company b etc... (Figure x).

    Finally, If you are comparing CS distances between sites then a percentage can be used to say something like: test sites a and b were on approximately 35% different from Site c , for example.






    The earth is round: P<0.05

  13. #13
    Points: 1,473, Level: 21
    Level completed: 73%, Points required for next Level: 27

    Posts
    13
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: comparing the DEGREE of difference between data sets

    Thanks for all the input guys, I really appreciate it.

    Your comments have made me wonder whether the type of data I have would allow me to calculate these different distance measures, or even my original euclidean distances properly.

    The data for my groups A, B, and C (logging companies) is in the form of a proportion for each of 8 tree species. For each group, I generated the proportion for tree species X by summing the the sizes (in Hectares) of all cut blocks whose leading species was species X, and dividing this by the total number of hectares logged by that company.

    So my raw data for groups A, B, and C is in the form of size and leading species for hundreds of cutblocks.

    However, for group D, what I have is the size (in hectares) of my entire study area, and the total area within this that species X is the leading species. I have divided these to get my overall proportions for each species.

    So when I generated my euclidean distance matrix, my input data for each group was the 8 tree species proportions, but was generated differently for group D than for groups A,B, and C

    Sorry if this is convoluted, this is the best way I can think of to describe what I've done.

    So I guess my question would be, given the type of data I have, are all these distance measures still on the table?

    Thanks again!

  14. #14
    Super Moderator
    Points: 31,766, Level: 100
    Level completed: 0%, Points required for next Level: 0
    bugman's Avatar
    Posts
    2,255
    Thanks
    290
    Thanked 324 Times in 265 Posts

    Re: comparing the DEGREE of difference between data sets

    I only sort of follow but, since you are using proportions you should be fine.

    If in doubt, you might want to standardise your data befores anlaysis to account for different areas.

    Im curious now to know what software you are using.
    The earth is round: P<0.05

  15. #15
    Points: 1,473, Level: 21
    Level completed: 73%, Points required for next Level: 27

    Posts
    13
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: comparing the DEGREE of difference between data sets


    Hey bugman,

    I'm using R for my analysis.

    So all I'm really interested in being able to say is that one group was more similar to group D than another group. It appears the consensus is that chi-squared distance will best do this, so I will try that. Although it seems R does not have a function to calculate this.

+ Reply to Thread
Page 1 of 2 1 2 LastLast

           




Similar Threads

  1. Most Significant Difference Between 2 Data Sets
    By dbajric in forum General Discussion
    Replies: 1
    Last Post: 04-17-2011, 02:05 AM
  2. Comparing Sets of Data
    By Dumb_Chemissed in forum Statistics
    Replies: 6
    Last Post: 07-09-2010, 06:07 PM
  3. Comparing two data sets
    By chuparfaan in forum Statistics
    Replies: 0
    Last Post: 04-14-2010, 12:17 PM
  4. comparing two data sets of variance
    By azndude in forum Statistics
    Replies: 1
    Last Post: 04-19-2009, 11:09 PM
  5. Comparing two price data sets
    By utopianbl in forum Psychology Statistics
    Replies: 0
    Last Post: 01-26-2008, 12:46 PM

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats