+ Reply to Thread
Results 1 to 8 of 8

Thread: This is a nightmare data set

  1. #1
    Super Moderator
    Points: 14,607, Level: 78
    Level completed: 40%, Points required for next Level: 243
    bugman's Avatar
    Posts
    1,492
    Thanks
    88
    Thanked 140 Times in 109 Posts

    This is a nightmare data set



    I'll make this brief so there is not too much confusion. I am looking for guidance and advice and generally just to bounce ideas of you guys.

    We have a job to analyse a pretty large dataset consisting of a number of predictors and two responses.

    Here it is:

    There are two pathogens- p1 and p2. 100 cells are stained each of p1 and p2 in a solution using a batch of this particular stain. The percentage of cells which actually take up the stain from p1 and p2 are calculated.

    The question is, is there any influence of water quality, nutrients, batch or institution (i.e. in their methodology) that influences the percent of cells that are succesfully stained.

    Now, what we want to do is determine, which of a number of independant variables may influence this percentage.

    However the problems I am grappling with are these:

    there are a mixture of continuous, censored and catergorical variables. Here are a few examples:

    Censored: nutrient concentrations often <0.02 mg/L from the solution
    Catergorical: batch number, institution
    Continous: pH, salinity, temperature....

    Also, there are a large number of missing values, especilly the waterqulaity variables and the reponse contains alot of zeros.

    Now, alot of this is outside of my experience becasue of the nature of the response (I.e. Proportions) vs. multiple IV with different distriubutions.

    1) how best to tackle alot of missing values (in some cases 80%)?
    2) how best to analyse proportion responses?

    Am I best to analyse p1 and p2 together given they come from the same solution?

    I was reading last week about boosted regression trees and was think that these might be appropriate here. What do you think? Or am I only thinking this becasue of recent introduction to the subject?

    Or am I looking too much into this where maybe a GLM with a negative binomial or zero inflated term for the high number of zeros might do?

    More detail can be supplied, but I'd love to hear you thoughts. This is an intreging data set, albiet a bit over whelming at this early stage!
    The earth is round: P<0.05

  2. #2
    RotParaTon
    Points: 46,224, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Discussion EnderPosting AwardFrequent PosterCommunity AwardMaster Tagger
    Dason's Avatar
    Location
    Ames, IA
    Posts
    9,075
    Thanks
    211
    Thanked 1,607 Times in 1,377 Posts

    Re: This is a nightmare data set

    If your response is the number of cells that were successfully stained out of 100 then I would think a logistic or probit regression might work?
    "His programming is malfunctioning. It begins! Get your weapons, he's going to become a killbot!!!" - bryangoodrich

  3. #3
    Super Moderator
    Points: 14,607, Level: 78
    Level completed: 40%, Points required for next Level: 243
    bugman's Avatar
    Posts
    1,492
    Thanks
    88
    Thanked 140 Times in 109 Posts

    Re: This is a nightmare data set

    Thanks Dason,

    would this be ok to fit with continous and caterorical IV's? And briefly, which would you use - Logistic or probit? I have never used probit and the distinction between the two has always been a bit of a grey area for me? Could you perhaps briefly explain?

    Thanks Mate
    The earth is round: P<0.05

  4. #4
    Points: 309, Level: 6
    Level completed: 18%, Points required for next Level: 41

    Location
    germany
    Posts
    7
    Thanks
    0
    Thanked 1 Time in 1 Post

    Re: This is a nightmare data set

    first try to estimate the missing values
    then do the analysis with the artificial dataset (-->easier than with missing values)
    then later, when the program is ready you can always easily try
    to analyse with different start-estimates

  5. The Following User Says Thank You to gsgs For This Useful Post:

    bugman (04-04-2012)

  6. #5
    Banned
    Points: 3,520, Level: 37
    Level completed: 14%, Points required for next Level: 130
    GretaGarbo's Avatar
    Posts
    419
    Thanks
    128
    Thanked 139 Times in 122 Posts

    Re: This is a nightmare data set

    I interpret your pathogens to be bacterials. (I don’t really understand how you can pick out (separate out) 100 bacterials of the p1 type and another 100 of p2 and stain each one of them? But never mind that.) Anyway, I interpret you as that there could be for example 23 stained of the p1 sort of bacteria and 36 stained of the p2 type.

    (First I thought that it would be natural to use logit or probit. But I hesitate if each bacterium is a Bernoulli-experiment or not? Each with a constant probability of being stained so that all 100 would be a binomial experiment (and therefore lead to logit or probit).)

    If the data ranges from0 to 100 (or 0 to 1) then a beta-distribution model might be appropriate.
    But from your text it seems like most data are close to zero. Then it might be enough with zero-inflated-Poisson or zero-inflated-negative binomial.

    Is it possible for you and your colleges to take a step back and claim do a designed experiment? I would guess that your biological and chemical colleges (and you) would feel much more self-confident with the results if the result were partly based on experiments. I was thinking of a factorial experiment for nutrient concentration, salinity, temperature etc. Some of these factors might be disturbance factors like in robust construction (the kind of experiment that was popularised by Tagushi.) The result might be more robust at certain levels, like at high salinity and low temperature.

    Such data would not be a “nightmares data set”, would it?

    (Regression trees sounds interesting. But if that is successful, more successful than linear models, depends on the degree of interactions, doesn’t it?)

    Censored response variables can be handled. But for censoring in the x-variables I don’t know. (I read about it not long ago but I don’t remember where.)

    Imputations of missing values have often been done with the EM-algorithm. But if that works here I don’t know.

  7. The Following User Says Thank You to GretaGarbo For This Useful Post:

    bugman (04-04-2012)

  8. #6
    Banned
    Points: 3,520, Level: 37
    Level completed: 14%, Points required for next Level: 130
    GretaGarbo's Avatar
    Posts
    419
    Thanks
    128
    Thanked 139 Times in 122 Posts

    Re: This is a nightmare data set

    Quote Originally Posted by bugman View Post

    I have never used probit and the distinction between the two has always been a bit of a grey area for me? Could you perhaps briefly explain?
    In another post I tried to explain (also having your question in mind) about logit and probit. (I will try to link here to that post, but I don’t know if it works.) (In the last post.)
    http://www.talkstats.com/showthread....point-location

    Otherwise here is other link about probit and so on.
    http://data.princeton.edu/wws509/notes/c3s7.html



    How are you doing with the nightmare? :-)

    What about doing an experimental design?
    Last edited by GretaGarbo; 04-27-2012 at 05:29 PM.

  9. #7
    Ninja say what!?!
    Points: 8,297, Level: 61
    Level completed: 49%, Points required for next Level: 153
    Link's Avatar
    Posts
    1,165
    Thanks
    37
    Thanked 82 Times in 75 Posts

    Re: This is a nightmare data set

    I forget. Are you a SAS user?

    There are two proc's that come to my mind. The first is proc MI, which has gotten pretty good at imputing missing data. The second is proc logistic, but using the r/n notation (where r is the number of cells taking up the stain and n is 100 if you're modelling each pathogen separately and 200 if you are modelling them together).

    Yes, you can use continuous and categorical variables in logistic regression. Just created dummy indicators for each group.

    The difference between a logistic model and a probit model is the link function used. See this for more info: http://www.upa.pdx.edu/IOA/newsom/da2/ho_link.pdf

    HTH

  10. #8
    Banned
    Points: 3,520, Level: 37
    Level completed: 14%, Points required for next Level: 130
    GretaGarbo's Avatar
    Posts
    419
    Thanks
    128
    Thanked 139 Times in 122 Posts

    Re: This is a nightmare data set


    There is an interview thread where they are interviewing Bugman. Well, they can do that if they want to. But Garbo does not give interviews.

    Well, I am sure that Bugman is a nice and interesting person, but I am not so interested in that. I am more interested in ehhh,…, here it is getting embarrassing again, I am actually interested in statistics.

    This thread was one of the first I saw and I still think it is one of the most interesting. I wonder what happened to this project? But I don’t want to ask something that Bugman does not want to, or are not allowed to answer.

    Instead I will give an interpretation about what might have happened just before this project started.

    Maybe the others in this project thought that it is important to investigate how these pathogens are influenced by many factors. I believe that they had some good quality data. But then I believe, i.e. guessing a complete fantasy, that they thought that it is really important, so that they searched for more data and decided to give Bugman all they had. So that they polluted the original good data with extra data that contained censored variables, missing values, excessively large amounts of zeros and other difficulties, in the naïve belief that the statistician could in some miraculous way sort out the good data from the bad data. They did data pollution.

    Or maybe something completely different happened in this project.

+ Reply to Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts








Advertise on Talk Stats