+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 15 of 18

Thread: Removing outliers using SAS

  1. #1
    Points: 851, Level: 15
    Level completed: 51%, Points required for next Level: 49

    Posts
    61
    Thanks
    27
    Thanked 1 Time in 1 Post

    Removing outliers using SAS




    I know "proc means" will give me the mean and standard deviation of a set of data.

    Is there any code that can remove outliers from a set of data? Outliers are defined to be values that are more or less than two standard deviations away from the mean.

  2. #2
    Points: 2,626, Level: 31
    Level completed: 18%, Points required for next Level: 124

    Location
    Dallas, TX
    Posts
    311
    Thanks
    12
    Thanked 94 Times in 93 Posts

    Re: Removing outliers using SAS

    Use proc standard. This will give you the standard normal distribution of your variables. Remove the variables outside the range of -3 to 3.

    This only treats univariate outliers. For multivariate outliers look at Mahalanobis distances. This can be done using proc princomp using std option in the statement. Other method is to do regress any unwanted variable (say row number) on the variables of interest (IVs) & calculate INFLUENCE statistic. This can be done in proc reg.

    proc reg data=;
    model row_num= a b c/ influence;
    run;

    This will give you a variable called leverage denoted in output as Hat diag H. Mahalanobis distance= MD=(n-1) (hat_diag_h-1/n). MD has a chi-sq distribution with degree of freedom =# of variables. This will give you the cut off beyond which you can label the observations as outliers.

  3. #3
    Points: 851, Level: 15
    Level completed: 51%, Points required for next Level: 49

    Posts
    61
    Thanks
    27
    Thanked 1 Time in 1 Post

    Re: Removing outliers using SAS

    Will this remove outliers in the cell only or the entire row in which the outlier resides?

  4. #4
    Points: 2,626, Level: 31
    Level completed: 18%, Points required for next Level: 124

    Location
    Dallas, TX
    Posts
    311
    Thanks
    12
    Thanked 94 Times in 93 Posts

    Re: Removing outliers using SAS

    The code mentioned above is for identifying the outliers & not for deleting them. Do you want to remove the entire observation or just that cell?

  5. #5
    Points: 851, Level: 15
    Level completed: 51%, Points required for next Level: 49

    Posts
    61
    Thanks
    27
    Thanked 1 Time in 1 Post

    Re: Removing outliers using SAS

    I'm doing multivariate regression with all the data. Do you recommend I remove the whole row or just the cell?

  6. #6
    Points: 851, Level: 15
    Level completed: 51%, Points required for next Level: 49

    Posts
    61
    Thanks
    27
    Thanked 1 Time in 1 Post

    Re: Removing outliers using SAS

    Also some of the data in some cells is missing for some rows. do you recommend I remove those rows too?

  7. #7
    Points: 2,626, Level: 31
    Level completed: 18%, Points required for next Level: 124

    Location
    Dallas, TX
    Posts
    311
    Thanks
    12
    Thanked 94 Times in 93 Posts

    Re: Removing outliers using SAS

    Treatment of outliers is a subjective issue. You must investigate why these outliers are there? There is not always the case to remove them but if you've to remove them delete the entire row.

    You can use where condition to subset your dataset. For multivariate outliers, calculate MD & merge that with the observations in the same dataset & then subset the obs based on the cut off for MD.

  8. The Following User Says Thank You to jrai For This Useful Post:

    david_q (01-31-2012)

  9. #8
    Points: 851, Level: 15
    Level completed: 51%, Points required for next Level: 49

    Posts
    61
    Thanks
    27
    Thanked 1 Time in 1 Post

    Re: Removing outliers using SAS

    Could you provide me with the code?

    proc reg data=table;
    model row_num= a b c/ influence;
    model x = a b c;
    run;

    ?

  10. #9
    Points: 2,626, Level: 31
    Level completed: 18%, Points required for next Level: 124

    Location
    Dallas, TX
    Posts
    311
    Thanks
    12
    Thanked 94 Times in 93 Posts

    Re: Removing outliers using SAS

    First save the influence statistics in a table:
    ods output OutputStatistics=table1;

    Run proc reg:

    proc reg data=table;
    model row_num= a b c/ influence;
    model x = a b c;
    run;

    Close the ods output:
    ods output close;

    Calculate MD:

    data table1(keep=md);
    set table1 nobs=n; *nobs=n will store the sample size in variable n;
    md=(n-1)*((hat_diag_h-1/n); *Check the exact name of the variable hat_diag_h using proc contents before using it here. I know that the label is 'Hat Diag H' but I don't
    remember the variable name;
    run;

    Merge the two datasets:

    data table;
    set table;
    set table1;
    run;

    Note that this process will be sensitive to missing values. Say your original dataset i.e. table has 10 obs but obs 1 & 2 had missing variable values. Now these will be excluded while calculating the output statistics. Now table1 will have 8 observations & the above code will automatically match md of obs 3 with obs 1 in dataset, md of obs 4 with obs2 & so on. Therefore, before running the whole program either treat missing values or exclude the observations with missing values.

    Now you've MD for each observation in your dataset. Now calculate the cutoff of MD. You can quickly do this in excel with CHIINV function. =CHIINV(probability,df). Probability here should be 0.05 & df i.e. degrees of freedom equals the number of variables (=3 in our case because we've 3 variables a,b & c). Therefore, =CHIINV(0.05,3)=7.81.

    Remove values greater than 7.81.

    data table (drop=md);
    set table (where=(md<=7.81));
    run;

    I'd say before removing outliers study them carefully & see if intuitively they really make sense. For example, your variables are jump distance & age. You might find that for a particular obs md>cutoff value. The jump distance for that obs=6 feet, which is not abnormal. But when you see age it equals 5 years. Definitely, 6 feet distance is not possible for a 5 year old child & hence it makes sense to discard the obs as an outlier. The whole idea is to investigate & justify whether outliers really make a case to be discarded or were these actual behaviors & need to stay in the model to enable the model to predict such instances in future.
    Last edited by jrai; 01-31-2012 at 12:54 PM.

  11. The Following User Says Thank You to jrai For This Useful Post:

    david_q (02-11-2012)

  12. #10
    Points: 851, Level: 15
    Level completed: 51%, Points required for next Level: 49

    Posts
    61
    Thanks
    27
    Thanked 1 Time in 1 Post

    Re: Removing outliers using SAS

    Jrai,
    I know you have done a lot for me alr but I need one more help.

    Right now I have a model with 1 dependent variable and 20 independent variables.

    So my code is

    proc reg data=Regression;
    model DEPD = A B C D E F G H I J K L M N O P Q R S T;
    run;

    Regression is my file name, DEPD is the dependent variable and A B C D E F G H I J K L M N O P Q R S T are the independent variables.

    How can I modify the code so that the regression only regresses for data within + or - 3 standard deviations of the mean for every value? All other values are ignored?

    e.g. on a smaller scale

    DEPD A B C D
    4567 9 2 7 8
    7567 12 3 9 0
    9875 3 2 67 7
    5635 9 12 7 88
    2568 9 2 7 8
    3547 9 2 7 8

    So the values I want SAS to regress are

    DEPD A B C D
    4567 9 2 7 8
    7567 12 3 9 0
    2568 9 2 7 8
    3547 9 2 7 8
    (elimination done by eyeballing so may not be accurate)

    i.e. the entire row is disregarded if one value in the column is more or less than 3 standard deviation from the mean for DEPD and A to T.

    Is this possible?

  13. #11
    Points: 851, Level: 15
    Level completed: 51%, Points required for next Level: 49

    Posts
    61
    Thanks
    27
    Thanked 1 Time in 1 Post

    Re: Removing outliers using SAS

    There are no missing values in the data.

  14. #12
    Points: 2,626, Level: 31
    Level completed: 18%, Points required for next Level: 124

    Location
    Dallas, TX
    Posts
    311
    Thanks
    12
    Thanked 94 Times in 93 Posts

    Re: Removing outliers using SAS

    One way is to find the SDs using proc standard, merge them with DV & then do the elimination.
    proc standard mean=0 std=1 data=regression out=temp(keep=depd); *This step finds the standard normal score i.e. z-score of depd;
    var depd;
    run;

    data regression; *This step merges standard normal score with the original data set. Note: ensure that the order of the observations in the two datasets is the same;
    set regression;
    set temp(rename=(depd=standard_depd));
    run;

    proc reg data=Regression(where=(standard_depd<3 and standard_depd>-3)); *This step excludes the observations where standard deviation is more than 3 or less than -3;
    model DEPD = A B C D E F G H I J K L M N O P Q R S T;
    run;
    Last edited by jrai; 02-11-2012 at 11:31 AM.

  15. #13
    Points: 2,626, Level: 31
    Level completed: 18%, Points required for next Level: 124

    Location
    Dallas, TX
    Posts
    311
    Thanks
    12
    Thanked 94 Times in 93 Posts

    Re: Removing outliers using SAS

    Alternative way is to calculate standard deviation directly:

    proc means data=regression;
    var depd;
    output out=temp(keep=depd _stat_ where=(_stat_="STD")); *The output dataset temp will contain the standard deviation of depd;
    run;

    data _null_;
    set temp(keep=depd);
    call symput("upper_limit",3*depd); *This creates a macro variable with the upper limit;
    call symput("lower_limit",-3*depd); *This creates a macro variable with the lower limit;
    run;

    proc reg data=Regression(where=(depd<&upper_limit and depd>&lower_limit)); *This step excludes the observations where standard deviation is more than 3 or less than -3;
    model DEPD = A B C D E F G H I J K L M N O P Q R S T;
    run;

  16. #14
    Points: 3, Level: 1
    Level completed: 5%, Points required for next Level: 47

    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Removing outliers using SAS

    Hi everyone,
    I have this set of data and I am looking for a SAS macro to find the outliers.
    Can anyone help me with this please?

    I would appreciate if someone can write me the macro using this set of data. Thanks

    Data
    0.26
    0.98
    0.74
    0.33
    0.23
    0.49
    0.33
    0.68
    1.09
    1.07
    0.85
    1.22
    0.20
    0.35
    0.30
    0.38
    0.31
    0.67

  17. #15
    Test of Gnomality
    Points: 14,202, Level: 77
    Level completed: 38%, Points required for next Level: 248
    hlsmith's Avatar
    Posts
    2,675
    Thanks
    168
    Thanked 445 Times in 433 Posts

    Re: Removing outliers using SAS


    As for the original post, jrai provided some great information. However, you never really say why you wanted to eliminate the outlying values - I wonder what their purpose was and whether they realize that once removing those values > or < 3 stds, that now there are new observations in the set that are < or > 3 stds away from the mean, this may be beside the point.

    canadian, what is the purpose of finding the outliers and how are you defining outliers. Also, I don't think you really need a macro to do this. You can review the previously posted code to come up with a solution.
    Disregard the number of posts I have on this forum => I likely have no idea what I am writing about!

+ Reply to Thread
Page 1 of 2 1 2 LastLast

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats