1. Re: Multivariate normality

An interesting approach to outliers for logistic regression

First, we run a baseline model including all cases
Second, we run a model excluding outliers (whose standardized residual is greater than 3.0 or less than 3.0) and influential cases (whose Cook's distance is greater than 1.0).
If the model excluding outliers and influential cases has a classification accuracy rate that is better than the baseline model, we will interpret the revised model. If the accuracy rate of the revised model without outliers and influential cases is less than 2% more accurate, we will interpret the baseline model.
I assume when they talk about classification accuracy they mean the Hosmer-Lemeshow goodness of fit test although I am not sure.

2. Re: Multivariate normality

I'm guessing by classification accuracy they mean that you predict the outcome to be 0 if the predicted probability is less than .5 and predict it to be 1 if the predicted probability is >= .5 (I guess one could toy with the cutoff to find the optimal accuracy). Then you can look at the proportion of cases you correctly predicted - call that your classification accuracy.

3. Re: Multivariate normality

Probably. Do you think that is a reasonable approach?

4. Re: Multivariate normality

An interesting take on outlier analysis similar I think to what Lazar suggested....

The Figure is a residual plot for the adjusted model. The horizontal axis shows the predicted probability of angina for each observation; the vertical axis shows the Pearson residual. The size of the plotted circle is proportional to the Cook’s distance for the observation. The higher curve is of subjects who developed angina, and the lower curve is of subjects who did not. Because the number of subjects who developed angina is smaller, their observations are generally more influential, and their circles tend to be larger. From the Figure, we can identify several possible problems. First, there are 2 observations with predicted probabilities of angina between 0.75 and 0.80. These come from 2 subjects with unusually high cholesterol values (600 and 696 mg/dL). The subject with 696 mg/dL did not develop angina, making a rather poor fit to the model and the most influential observation in these data, shown by having the largest circle. There are also subjects who developed angina despite having a very low predicted probability in the model. The low predicted probabilities for these subjects were primarily due to low cholesterol values. The mismatch between the observed angina rates and low predicted probability of angina in the regression model for these subjects creates large residuals, and these are the points in the upper left region of the Figure. A substantial number of these subjects have residual values >3 and might be considered outliers.
They looked for outliers and then tried to find what was unusual about them (or rather why they were unusual).

http://circ.ahajournals.org/content/117/18/2395.full

They then suggest a sensitivity analysis by removing points associated with the elements that make the outlier unusual [such as unusual cholesterol counts] and seeing what that does to the regression results.

5. Re: Multivariate normality

From two of Dason's favorite authors TABACHNICK and FIDEL Its on page 74 which deals with data clean up.

"Transformations are undertaken to improve the normality of the distributions and to pull univariate outliers closer to the center of the distribution therefore reducing their impact. Transformations, if acceptable, are undertaken prior to the serach for multivariate outliers because the the statistics used to reveal them [multivariate outliers] (Mahalanobis distance and its variants) are also sensitive to failure of normality."

Transformations, such as logging are therefore tied primarily to univerate analysis (which assumes of course that univariate normality matters, otherwise you would not tranform based on that). Which is pretty common in text.

This comes from the 5th ed of "Using Multivariate Statistics." the fact that it is on its fifth edition suggest it is pretty popular with professors, who drive much of the text market

They go on to say on p 87: "With almost every data set in which we have used transformations, the results of analysis have been substantially improved. This is partiularly true when some variables are skewed and others are not, or variables are skewed very differently prior to transformation.'

So even when you are not required to deal with normality, they feel it improves the results to do so (and again there transformations are tied to univariate analysis not multivariate analysis it would appear).

6. Re: Multivariate normality

Take a look at Henze-Zirkler's Multivariate Normality Test.

Page 4 of 4 First 1 2 3 4

 Tweet