+ Reply to Thread
Results 1 to 7 of 7

Thread: Discriminant analysis with mostly incomplete cases

  1. #1
    Points: 2,578, Level: 30
    Level completed: 86%, Points required for next Level: 22

    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Discriminant analysis with mostly incomplete cases




    Hello. I have a bit of a biostats problem.

    Iíve got data on 5 groups and 15 species. There are 9 variables, and 1000 cases total. The problem, however, is that less than 20% of the cases have full compliments of data. In fact, 61% of measurements are missing because they were unobtainable (Iím dealing with fragile fossils).

    As it is, because I didnít personally collect most of the data, Iím sure that many cases were actually the result of combination, where the recorder took 2 measurements from one specimen, and 2 from another of the same species and collection, and threw them together as 1 case.

    My goal is to perform a discriminant function analysis demonstrating that the 5 groups are indeed separate, but that most species in each group are not distinguishable from other species in the group.

    If I only use the few complete cases, then this is pretty much impossible. Not to mention that the complete cases are biased as they do not represent the distribution for each species or group. Also, univariate analysis is a dead end for distinguishing the groups.

    Itís been suggested to me to combine cases if they are in the same species in order to create as many complete cases as possible. As I mentioned, Iím sure that this has already been done for many of the cases by the data recorders. Itís also been suggested to (after normalizing the data), use the distributions, means, and standard deviations for variables in each species to generate many cases, and use those for discriminant analysis.

    Any thoughts on what to do and how to proceed?

  2. #2
    TS Contributor
    Points: 8,362, Level: 61
    Level completed: 71%, Points required for next Level: 88

    Location
    Crete, Greece
    Posts
    717
    Thanks
    0
    Thanked 35 Times in 34 Posts
    if the var-vov matrices are not the same and then you will have to try quadratic discrim analysis instead of linear. i used it and it provides better results.

  3. #3
    Points: 3,731, Level: 38
    Level completed: 54%, Points required for next Level: 69

    Posts
    52
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Just The ANSWER to your question

    Hi guys.

    Let me give the method that keeps conditional joint distribution of X assuming that
    underlying missing data mechanism is ignorable.
    This means that the probability p_j of jth variable value to be missed doesn't depend on that value.
    In this case the following algorithm seems to be relevant for your problem.
    For example you have
    case1=(y_1,...y_k, y_{k+1}=missed, ... , y_n=missed)
    and k+1.

    Let's look for all the other cases that have 1,...,kth variables filled
    and also the k+1th filled. Denote this set of cases by M. Find nearest observation (taking into account that different vars have different ranges dividing y_j-y'_j by \sigma_j) according to euclidean distance.

    Denote the set of nearest observations by M(case1,k+1).
    Choose the point x=(x_1,...x_{k+1},...) from M(case1,k+1) randomly and fill k+1th variable of case1 by its value x_{k+1}.

    Do it for all groups separately, and for all missing values and cases.
    Last edited by kobylkinks; 03-02-2009 at 12:49 PM.

  4. #4
    Points: 3,731, Level: 38
    Level completed: 54%, Points required for next Level: 69

    Posts
    52
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Refinements

    I meant find the nearest in M where the euclidean distance is defined in R^k - the space of first k vars.
    M(case1,k+1) is subset of M
    Last edited by kobylkinks; 03-02-2009 at 11:48 AM.

  5. #5
    Points: 2,581, Level: 30
    Level completed: 88%, Points required for next Level: 19

    Location
    Texas
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    What is sigma_j?

    When choosing a substitute value for a missing value does it matter if it came within the same group/species? I would think so, or maybe not. If it does make a difference then 1000 cases divided into 5 groups and then into 15 species with 9 variables would make me think there is little substitute data to use.

  6. #6
    Points: 3,731, Level: 38
    Level completed: 54%, Points required for next Level: 69

    Posts
    52
    Thanks
    0
    Thanked 0 Times in 0 Posts
    sigma_j is standard deviation i.e square root of variance for jth var

  7. #7
    Points: 2,581, Level: 30
    Level completed: 88%, Points required for next Level: 19

    Location
    Texas
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Doh, for some reason I was thinking of epsilon. Whoops!

+ Reply to Thread

           




Similar Threads

  1. discriminant functions analysis
    By Poppy in forum Statistics
    Replies: 1
    Last Post: 03-17-2011, 04:29 AM
  2. Discriminant Analysis Issue
    By cinta19 in forum Regression Analysis
    Replies: 0
    Last Post: 02-27-2011, 07:26 PM
  3. [SYSTAT] - help - discriminant analysis
    By krisfire in forum Other Software
    Replies: 4
    Last Post: 01-27-2011, 06:56 AM
  4. Quadratic Discriminant Analysis
    By MarcioRibeiro in forum R
    Replies: 2
    Last Post: 04-08-2009, 03:56 PM
  5. Discriminant function analysis
    By jacks in forum Statistics
    Replies: 1
    Last Post: 10-03-2008, 07:08 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats