calculation of the average leverage when predictor(s) is categorical

gianmarco

TS Contributor
#1
Hello,
I was reading different resources about regression diagnostic, in particular for Logistic Regression.
As for leverage, the sources suggest to seek for observations with higher-than-average leverage.

Now, where I am confused is about how the mean leverage is calculated.
One sources suggests: (k+1)/N
where k=number of predictors, N=sample size

My question:
1) if one of the predictors is categorical, in k do we have to also count the levels of the categorical predictor?
2) do we have to also count the intercept (I think not)?

As for a practical example, given the dataset and the model below, how would you calculate the average leverage?
Code:
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
mydata$rank <- factor(mydata$rank)

> head(mydata)
  admit gre  gpa rank
1     0 380 3.61    3
2     1 660 3.67    3
3     1 800 4.00    1
4     1 640 3.19    4
5     0 520 2.93    4
6     1 760 3.00    2

fit <- glm(admit ~ gre + gpa + rank, data=mydata, family=binomial(logit))
What I am wondering is if, when counting the number of predictors (i.e., devising k), do we have to also count the number of levels of categorical predictors?
In other words, if we have 1 continuous predictor and 1 categorical predictor with 3 levels, k would be:
2 (i.e., 1 continous predictor + 1 categorical predictor)
or
3 (i.e., 1 continuous predictor + 2 [i.e., the levels of the categ predictor minus one due to dummy coding]) ?

Thanks for any clarification
gm
 

hlsmith

Omega Contributor
#3
"leverage Measures the potential impact of an individual case on the results, which is directly proportional to how far an individual case is from the centroid in the space of the predictors. Leverage is computed as the diagonal elements, h sub ii , of the "Hat" matrix, bold H ,
bold H = bold X star ( bold X star prime bold X star ) sup -1 bold X star prime​
where bold X star = bold V sup 1/2 bold X , and bold V = diag { P Hat ( 1 - P Hat ) } . As in OLS, leverage values are between 0 and 1, and a leverage value, h sub ii > 2 k / n is considered "large"; k = number of predictors, n = number of cases."

Taken from: http://www.datavis.ca/courses/grcat/grc6.html


I would say per my opinion, you would not include the intercept in the count and yes account for >/= 3 group categories. So TS status (human, bot, raptor) would count as 2 predictors. Still regularly using SAS, so feel free to post R code for my edification.