I know "proc means" will give me the mean and standard deviation of a set of data.

Is there any code that can remove outliers from a set of data? Outliers are defined to be values that are more or less than two standard deviations away from the mean.

Use proc standard. This will give you the standard normal distribution of your variables. Remove the variables outside the range of -3 to 3.

This only treats univariate outliers. For multivariate outliers look at Mahalanobis distances. This can be done using proc princomp using std option in the statement. Other method is to do regress any unwanted variable (say row number) on the variables of interest (IVs) & calculate INFLUENCE statistic. This can be done in proc reg.

proc reg data=;
model row_num= a b c/ influence;
run;

This will give you a variable called leverage denoted in output as Hat diag H. Mahalanobis distance= MD=(n-1) (hat_diag_h-1/n). MD has a chi-sq distribution with degree of freedom =# of variables. This will give you the cut off beyond which you can label the observations as outliers.

Treatment of outliers is a subjective issue. You must investigate why these outliers are there? There is not always the case to remove them but if you've to remove them delete the entire row.

You can use where condition to subset your dataset. For multivariate outliers, calculate MD & merge that with the observations in the same dataset & then subset the obs based on the cut off for MD.

First save the influence statistics in a table:
ods output OutputStatistics=table1;

Run proc reg:

proc reg data=table;
model row_num= a b c/ influence;
model x = a b c;
run;

Close the ods output:
ods output close;

Calculate MD:

data table1(keep=md);
set table1 nobs=n; *nobs=n will store the sample size in variable n;
md=(n-1)*((hat_diag_h-1/n); *Check the exact name of the variable hat_diag_h using proc contents before using it here. I know that the label is 'Hat Diag H' but I don't
remember the variable name;
run;

Merge the two datasets:

data table;
set table;
set table1;
run;

Note that this process will be sensitive to missing values. Say your original dataset i.e. table has 10 obs but obs 1 & 2 had missing variable values. Now these will be excluded while calculating the output statistics. Now table1 will have 8 observations & the above code will automatically match md of obs 3 with obs 1 in dataset, md of obs 4 with obs2 & so on. Therefore, before running the whole program either treat missing values or exclude the observations with missing values.

Now you've MD for each observation in your dataset. Now calculate the cutoff of MD. You can quickly do this in excel with CHIINV function. =CHIINV(probability,df). Probability here should be 0.05 & df i.e. degrees of freedom equals the number of variables (=3 in our case because we've 3 variables a,b & c). Therefore, =CHIINV(0.05,3)=7.81.

Remove values greater than 7.81.

data table (drop=md);
set table (where=(md<=7.81));
run;

I'd say before removing outliers study them carefully & see if intuitively they really make sense. For example, your variables are jump distance & age. You might find that for a particular obs md>cutoff value. The jump distance for that obs=6 feet, which is not abnormal. But when you see age it equals 5 years. Definitely, 6 feet distance is not possible for a 5 year old child & hence it makes sense to discard the obs as an outlier. The whole idea is to investigate & justify whether outliers really make a case to be discarded or were these actual behaviors & need to stay in the model to enable the model to predict such instances in future.

Jrai,
I know you have done a lot for me alr but I need one more help.

Right now I have a model with 1 dependent variable and 20 independent variables.

So my code is

proc reg data=Regression;
model DEPD = A B C D E F G H I J K L M N O P Q R S T;
run;

Regression is my file name, DEPD is the dependent variable and A B C D E F G H I J K L M N O P Q R S T are the independent variables.

How can I modify the code so that the regression only regresses for data within + or - 3 standard deviations of the mean for every value? All other values are ignored?

e.g. on a smaller scale

DEPD A B C D
4567 9 2 7 8
7567 12 3 9 0
9875 3 2 67 7
5635 9 12 7 88
2568 9 2 7 8
3547 9 2 7 8

So the values I want SAS to regress are

DEPD A B C D
4567 9 2 7 8
7567 12 3 9 0
2568 9 2 7 8
3547 9 2 7 8
(elimination done by eyeballing so may not be accurate)

i.e. the entire row is disregarded if one value in the column is more or less than 3 standard deviation from the mean for DEPD and A to T.

One way is to find the SDs using proc standard, merge them with DV & then do the elimination.
proc standard mean=0 std=1 data=regression out=temp(keep=depd); *This step finds the standard normal score i.e. z-score of depd;
var depd;
run;

data regression; *This step merges standard normal score with the original data set. Note: ensure that the order of the observations in the two datasets is the same;
set regression;
set temp(rename=(depd=standard_depd));
run;

proc reg data=Regression(where=(standard_depd<3 and standard_depd>-3)); *This step excludes the observations where standard deviation is more than 3 or less than -3;
model DEPD = A B C D E F G H I J K L M N O P Q R S T;
run;

Alternative way is to calculate standard deviation directly:

proc means data=regression;
var depd;
output out=temp(keep=depd _stat_ where=(_stat_="STD")); *The output dataset temp will contain the standard deviation of depd;
run;

data _null_;
set temp(keep=depd);
call symput("upper_limit",3*depd); *This creates a macro variable with the upper limit;
call symput("lower_limit",-3*depd); *This creates a macro variable with the lower limit;
run;

proc reg data=Regression(where=(depd<&upper_limit and depd>&lower_limit)); *This step excludes the observations where standard deviation is more than 3 or less than -3;
model DEPD = A B C D E F G H I J K L M N O P Q R S T;
run;

As for the original post, jrai provided some great information. However, you never really say why you wanted to eliminate the outlying values - I wonder what their purpose was and whether they realize that once removing those values > or < 3 stds, that now there are new observations in the set that are < or > 3 stds away from the mean, this may be beside the point.

canadian, what is the purpose of finding the outliers and how are you defining outliers. Also, I don't think you really need a macro to do this. You can review the previously posted code to come up with a solution.