+ Reply to Thread
Page 2 of 2 FirstFirst 1 2
Results 16 to 29 of 29

Thread: Regression diagnostics with proc glm or proc reg

  1. #16
    Points: 10,148, Level: 67
    Level completed: 25%, Points required for next Level: 302

    Posts
    158
    Thanks
    88
    Thanked 1 Time in 1 Post

    Re: Regression diagnostics with proc glm or proc reg




    Any clue?

  2. #17
    Points: 2,626, Level: 31
    Level completed: 18%, Points required for next Level: 124

    Location
    Dallas, TX
    Posts
    311
    Thanks
    12
    Thanked 90 Times in 88 Posts

    Re: Regression diagnostics with proc glm or proc reg

    Quote Originally Posted by StatsClue View Post
    That code for cookd and residual worked but my N >1000, so it's tough looking for influential observations (though I'm not even sure what to do with them if I do find them).

    Is there a way to get SAS to print out only the observations of concern, with proc glm?

    Thanks.
    Proc glm would just output the cook's D in the output dataset. Say cook's D statistic is in a dataset named analysis & it is given by variable cook. The influential variables are the ones with cook's D > 4/n.

    proc print data=analysis(where=(cook>4/1001)); *assuming your dataset has 1001 observations;
    run;

    Once you see the observations with high influence then you investigate them & follow your policy of outliers.

    For more discussion on detecting outliers you can see the following thread: http://www.talkstats.com/showthread....iers-using-SAS

  3. The Following User Says Thank You to jrai For This Useful Post:

    StatsClue (02-11-2012)

  4. #18
    Points: 10,148, Level: 67
    Level completed: 25%, Points required for next Level: 302

    Posts
    158
    Thanks
    88
    Thanked 1 Time in 1 Post

    Re: Regression diagnostics with proc glm or proc reg

    Worked. Got about 90 or so observations. If they are so many, how are they even outliers?

    From what I've read, one rechecks the data, increases the sample size or finds an excuse to delete the outliers. I don't think any of this is possible in my case. Will my model be bad if I don't do anything about the outliers? THANKS.

  5. #19
    Points: 2,626, Level: 31
    Level completed: 18%, Points required for next Level: 124

    Location
    Dallas, TX
    Posts
    311
    Thanks
    12
    Thanked 90 Times in 88 Posts

    Re: Regression diagnostics with proc glm or proc reg

    Try using this:

    proc standard mean=0 std=1 data= out=temp; *This step finds the standard normal score i.e. z-score of all numeric variables;
    var List_of_IVs;
    run;

    Then find the observations where IVs are more than 3 or less than -3. These are possible outliers. This method will give you smaller set of possible outliers. If you think that these outliers are really rare or such high values are not the general case or don't make sense then you can delete them.

  6. The Following User Says Thank You to jrai For This Useful Post:

    StatsClue (02-11-2012)

  7. #20
    Points: 10,148, Level: 67
    Level completed: 25%, Points required for next Level: 302

    Posts
    158
    Thanks
    88
    Thanked 1 Time in 1 Post

    Re: Regression diagnostics with proc glm or proc reg

    proc standard mean=0 std=1 data=StatsClue out=temp;
    var continuousvar1 continuousvar2 continuousvar3;
    run;

    IT RAN but gave no output! <scratching head>

  8. #21
    Devorador de queso
    Points: 95,540, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,930
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: Regression diagnostics with proc glm or proc reg

    I believe it just created a new dataset which you decided to call 'temp'. You could use proc print or open up that dataset manually to examine the output.
    I don't have emotions and sometimes that makes me very sad.

  9. The Following User Says Thank You to Dason For This Useful Post:

    StatsClue (02-11-2012)

  10. #22
    Points: 10,148, Level: 67
    Level completed: 25%, Points required for next Level: 302

    Posts
    158
    Thanks
    88
    Thanked 1 Time in 1 Post

    Re: Regression diagnostics with proc glm or proc reg

    proc print data=temp;
    run;

    Printed my entire dataset (with cook and residual)! I thought it would print only the outliers!

  11. #23
    Devorador de queso
    Points: 95,540, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,930
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: Regression diagnostics with proc glm or proc reg

    Well all that did was standardize your data. You'd need to do a data step or something else to identify the outliers in some way.
    I don't have emotions and sometimes that makes me very sad.

  12. #24
    Points: 2,626, Level: 31
    Level completed: 18%, Points required for next Level: 124

    Location
    Dallas, TX
    Posts
    311
    Thanks
    12
    Thanked 90 Times in 88 Posts

    Re: Regression diagnostics with proc glm or proc reg

    The variables with z-scores>3 or <-3 are outliers & must be investigated:

    proc print data=temp(where=(continuousvar1>3 or continuousvar2>3 or continuousvar3>3 or continuousvar1<-3 continuousvar2<-3 continuousvar3<-3));
    run;

    Check the observations so obtained in your original dataset i.e. statsclue because temp has standardized values which won't make much sense. The actual values are in statsclue. Pick up the observation numbers printed from above code & check the corresponding observations in original dataset.

  13. #25
    Points: 2,626, Level: 31
    Level completed: 18%, Points required for next Level: 124

    Location
    Dallas, TX
    Posts
    311
    Thanks
    12
    Thanked 90 Times in 88 Posts

    Re: Regression diagnostics with proc glm or proc reg

    According to the empirical rule if your variables are approximately normally distributed then 0.3% observations will fall outside the +-3 range. Therefore, with 1000 observations you should expect # of outliers=0.3*#of variables. It should approx. be equal to this number.
    Last edited by jrai; 02-11-2012 at 09:18 PM.

  14. The Following User Says Thank You to jrai For This Useful Post:

    StatsClue (02-11-2012)

  15. #26
    Devorador de queso
    Points: 95,540, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,930
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: Regression diagnostics with proc glm or proc reg

    That is very dependent on the distribution being exactly normal. Chebyshev's inequality says that we can guarantee that the probability that an observation is more than 3 standard deviations away from the mean is at most 1/9 (approximately 11%).
    I don't have emotions and sometimes that makes me very sad.

  16. #27
    Points: 10,148, Level: 67
    Level completed: 25%, Points required for next Level: 302

    Posts
    158
    Thanks
    88
    Thanked 1 Time in 1 Post

    Re: Regression diagnostics with proc glm or proc reg

    That command printed 42 observations. My N=1200.

    .003*1200 = 3.6

    3.6 * 4 (i.e. the 4 continuous predictors. doesn't include categorical and interactions) =14.4
    Last edited by StatsClue; 02-11-2012 at 07:14 PM.

  17. #28
    Points: 10,148, Level: 67
    Level completed: 25%, Points required for next Level: 302

    Posts
    158
    Thanks
    88
    Thanked 1 Time in 1 Post

    Re: Regression diagnostics with proc glm or proc reg

    I had another question about the identified outliers. It struck me that I'd identified the 42 outliers based on this:
    proc standard mean=0 std=1 data= statsclue out=temp;
    var continuousvar1 continuousvar2 age weight;
    run;
    proc print data=temp(where=(continuousvar1 >3 or continuousvar2 >3 or age>3 or weight>3 or continuousvar1<-3 or continuousvar2 <-3 or age<-3 or weight<-3));
    run;

    but shouldn't outliers of concern be based on the OUTCOME, i.e. DOSE? If they are to be based on variables, the above is basing them on continuous variables only, ignoring the categorical ones. Is this the right thing to do? OR should I have included the dose i.e. outcome as well, in the var statement and forgotten about the categorical variables since 'means' won't work on them? Thanks.

  18. #29
    Points: 2,626, Level: 31
    Level completed: 18%, Points required for next Level: 124

    Location
    Dallas, TX
    Posts
    311
    Thanks
    12
    Thanked 90 Times in 88 Posts

    Re: Regression diagnostics with proc glm or proc reg


    Including dose should be a good idea. At times while building predictive models I remove bottom 1% & top 1% DV obs.

    Including categorical variables doesn't make sense because they are just categories & if you see the plot of DV against categoricals you'll see that they all fall on straight vertical lines. Think conceptually, what would be a categorical outlier? At times there might be cases when a categorical variable has some levels with relatively very low frequencies or a categorical variable with many levels (thumb rule more than 20). In that case, it is good to merge categories & reduce them. Categories can be merged by some sort of clustering algorithm or chaid.

  19. The Following User Says Thank You to jrai For This Useful Post:

    StatsClue (02-13-2012)

+ Reply to Thread
Page 2 of 2 FirstFirst 1 2

           




Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats