Valid use of regression


No cake for spunky
I have a dependent variable which status you close in (its a dummy variable). My predictor is how many months of experience counselors associated with the case has. There are about 11,5000 cases but only about 400 distinct counselors - they have many cases. So you have one number of the dv but the number of distinct predictors varies a lot from that (is much smaller). Does the linear regression assume the predictors are independent of each other in this case (I am not sure if the assumption of independence applies in this fashion or not). This is a really basic question, but I never thought about it before. :p

I guess I am saying do you have to run a different type of regression or interpret the results different this way. I was going to look on line but I could not even think what topic covers this.


Global Moderator
Sounds like you want to look at a GLM (binomial), if your DV has 1 (close) or 0 (not). Second, all regressions assume that the predictor variables are uncorrelated - otherwise you run into issues of collinearity.

Does this help?


No cake for spunky
For my DV I was using logistic regression - which I think would be ok based on your comments. I only have one predictor so Multicolinearity is not an issue. My concern is whether I violate the assumption of independence because I have counselors handling multiple cases as the predictor. So one counselor (the predictor) handles more than one of the dependent variable cases. Is that a violation of regression or invalidates the use of logistic regression?

In all the years of reading about regression I have not seen that addressed.


No cake for spunky
An answer I received elsewhere. Which was totally new to me.

Your question is about cluster-robust inference, and the short answer is that typically, this does not change your estimate of a parameter (such as ββ is a linear regression or a logistic regression), but it will affect your standard errors. Typically, two assumptions that are commonly made are that standard errors are uncorrelated across observations, and that the variance of the error term is constant (this is called homoskedasticity).
In the case of uncorrelated errors that are different, the extension for linear regressions is to compute White standard errors.
In your case, the issue is that standard errors are indeed correlated across observations, but in a particular way: they are correlated across the distinct counselors. This is called clustered errors, and many methods exist to accommodate clustering. See this Stackoverflow post for some R packages that allow for clustering in logistic regressions.
Additionally, I highly suggest you take a look (at least at the intro and first few sections) of this excellent introduction to clustered errors.
Last edited:


No cake for spunky
What happens if I delete spunky's before you delete mine? So the board gets better for my posts gone and worse for spunky's gone and I have about a 100 times the post he does (probably more).


Ambassador to the humans
You have less than 4x as many posts as he does. I'm also not sure how spunkys posts are relevant. I'm totally in favor of deleting all of @spunky 's posts.