# Thread: Dealing with violations of the assumptions.

1. ## Re: Dealing with violations of the assumptions.

Originally Posted by noetsi
When I conduct analysis I am rarely interested in what the slope of a variable is. I want to know, or those who I run the data for want to know, if the results are signficant or not. The literature I have seen in the social sciences stresses the same, this variable was signficant or was not. So the test not the effect size is the driving force. Which variable is relatively more important is critical to me, but slopes don't tell you that (as dason has reminded me on more than one occasion).
The test just gives you a yes/no answer. If you want to look at relative impact of different variables, you do need to look at the standardized betas.

2. ## Re: Dealing with violations of the assumptions.

Originally Posted by noetsi
I am not sure what the equivalent test would be for the data I run (which is not an industrial process).

I am not sure what you mean by plotting the residuals in both time sequences or by fits?
These are residuals diagnostic plots from Minitab. In addition to the normality test, residuals are plotted against fits to examine for heteroskedacity or curvature. And, if the data are taken in time sequence, the time order plot is examined for shifts and trends that might indicate a latent variable. If the data are not in time sequence, this last graph is ignored.

This site has a good discussion on the residuals vs. fits analysis.

This site discusses both residuals vs. fits and residuals vs. time order.

3. ## The Following User Says Thank You to Miner For This Useful Post:

noetsi (04-24-2015)

4. ## Re: Dealing with violations of the assumptions.

Originally Posted by Injektilo
The test just gives you a yes/no answer. If you want to look at relative impact of different variables, you do need to look at the standardized betas.
There is some debate if the standardized betas are adequate measures of relative impact even in linear regression. But the real problem occurs in logistic regression where there are no agreed on standardized betas and where the ones that have been proposed generate signficantly different results in some cases. Unfortunately for me, much of my analysis is with logistic regression even though this thread is not on that.

5. ## Re: Dealing with violations of the assumptions.

Thanks miner. I knew of plotting residuals against fitted values. I had not heard of the time series variant (probably because I mainly work in ESM and assumptions are largely ignored in that form of time series).

6. ## Re: Dealing with violations of the assumptions.

Here is a point that has long puzzled me. Does this mean that if we have samples of say 300 (none of mine would ever be less than that) it does not matter if your data is non-normal for Regression? I would say I had read this in many sources, which I have, but than we would have to define "many"

In the field of statistics , there are lots of methods that are practically guaranteed to work well if the data are approximately normally distributed and if all we are interested in are linear combinations of these normally distributed variables. In fact, if our sample sizes are large enough we can use the central limit theorem which tells us that we would expect means to converge on normality so we do not even need to have samples from a normal distribution as N increases. So if we have two groups of say 100 subjects each and we are interested in mean change from baseline of a variable then we have no need to worry and can apply standard statistical methods with only the most basic of checks for statistical validity.
http://www.lexjansen.com/phuse/2005/pk/pk02.pdf

I note this only applies to means. I assume this does apply to slopes, but I am not sure if it applies to say standard errors for example.

7. ## Re: Dealing with violations of the assumptions.

I intuitively feel if the means are converging then the SEs are converging.

You should have prefaced that this is in regards to bootstrapping, at least that is what the link is to. This ends up being a sampling question. You get this convergence with repeated sampling even in non-normal data. Though, if the variable in the population was non-normal making the sample size bigger will not end up having that awesome effect of normality as resampling does.

8. ## Re: Dealing with violations of the assumptions.

If I understand the author correctly they are commenting on ANOVA in that comment although the link in general is about bootstrapping. It is tied not to bootstrapping but to the central limit theorem. She uses bootstrapping only when the CLM does not apply

9. ## Re: Dealing with violations of the assumptions.

Originally Posted by noetsi
Here is a point that has long puzzled me. Does this mean that if we have samples of say 300 (none of mine would ever be less than that) it does not matter if your data is non-normal for Regression?
If the errors are independent then as the sample size grows larger, the sampling distribution of the coefficients (e.g., the slopes) will converge to a normal distribution regardless of whether or not the errors are normally distributed. If the sampling distribution is normal, then confidence intervals and significance tests will be trustworthy. It's hard to put a number on exactly how big a sample is large enough that you can be sure that the sampling distribution will be sufficiently approximated by a normal distribution, because it does depend on stuff like what the distribution of the errors actually is. But with a sample size of 300 the error distribution would have to be something truly evil and pathological for it to make the slightest bit of difference. If I was in your shoes I'd worry much more about other assumptions like whether the errors have conditional means of zero, and whether they're independently and identically distributed. Breaches of these assumptions have much more serious consequences.

10. ## The Following User Says Thank You to CowboyBear For This Useful Post:

noetsi (04-24-2015)

11. ## Re: Dealing with violations of the assumptions.

I want to make sure I don't misunderstand this CWB. Given sample sizes of 300 or more (and in honesty mine commonly have 1000's) a non-normal distribution is not going to effect either the CI or the test of signficance greatly. If so, well I still am learning bootstrapping so some good came out of this

The assumptions I have focused on are multicolinearity, linearity, leverage (although with the size of the samples I have it is unlikely any small set of points is going to move the regression line much), and equal error variance. Independence of observations is important as well, but my understanding of that is there is no diagnostic that will catch it - you have to deal with this in your design and I have no knowledge/control at all of data collection. There is no way to imagine my errors not being independent, but no way I can think of to verify that. The data is being collected by hundreds of different counselors across the state and analysis is run regularly to remove invalid results.

12. ## Re: Dealing with violations of the assumptions.

A follow up question for bootstrapping. Ok so you generate say a 1000 replications of your data set and you run regression on each. So what are your slopes for each IV? The average of the 1000 slopes (given how means are handled in boostrapping I assume so, but I have not found anyone yet who says so). I am not sure you would want to do this because the point estimate is correct even without normality given a reasonable sized sample.

More importantly, how do you conduct the test of statistical signficance? You obviously have a 1000 p values - but how do you go from that to whether the slope is statistically signficant overall?

I guess you could calculate the CI from the bootstrap around the averaged slope (or around the original model you run without a bootstrap) and see if it contains 0 Somehow that does not seem right.

13. ## Re: Dealing with violations of the assumptions.

Originally Posted by noetsi
I want to make sure I don't misunderstand this CWB. Given sample sizes of 300 or more (and in honesty mine commonly have 1000's) a non-normal distribution is not going to effect either the CI or the test of signficance greatly. If so, well I still am learning bootstrapping so some good came out of this
Yep! And that's awesome

The assumptions I have focused on are multicolinearity, linearity, leverage (although with the size of the samples I have it is unlikely any small set of points is going to move the regression line much), and equal error variance.
The multicollinearity "assumption" is really just that there is no perfect collinearity between variables. Less severe multicollinearity isn't so much an assumption breach (if the other assumptions are met the estimates remain consistent, unbiased and efficient, and significance tests and confidence intervals are trustworthy). It's just that multicollinearity increases your standard errors (they're not wrong, they're just larger than they would be if the multicollinearity isn't present). I'd agree with not worrying about leverage with such huge sample sizes, and again that isn't so much a distributional assumption per se as more of a general potential problem.

Independence of observations is important as well, but my understanding of that is there is no diagnostic that will catch it - you have to deal with this in your design and I have no knowledge/control at all of data collection. There is no way to imagine my errors not being independent, but no way I can think of to verify that. The data is being collected by hundreds of different counselors across the state and analysis is run regularly to remove invalid results.
Eh, it depends on the type of non-independence. There are some things you can catch with diagnostic tests (e.g., autocorrelated errors in a time series study). But yeah in general it can be a hard thing to catch.

A follow up question for bootstrapping. Ok so you generate say a 1000 replications of your data set and you run regression on each. So what are your slopes for each IV? The average of the 1000 slopes (given how means are handled in boostrapping I assume so, but I have not found anyone yet who says so). I am not sure you would want to do this because the point estimate is correct even without normality given a reasonable sized sample.
I think you'd usually just use the point estimate. (Others want to chime in here?) The point estimate and the mean of the replications should be very similar in any case.

More importantly, how do you conduct the test of statistical signficance? You obviously have a 1000 p values - but how do you go from that to whether the slope is statistically signficant overall?

I guess you could calculate the CI from the bootstrap around the averaged slope (or around the original model you run without a bootstrap) and see if it contains 0 Somehow that does not seem right.
Personally I'd edge toward using confidence intervals anyway so would use the latter strategy, and I think that'd be quite standard. But I think it is possible to obtain p values via bootstrapping in a couple different ways - e.g., see this post.

14. ## Re: Dealing with violations of the assumptions.

Originally Posted by CowboyBear
The multicollinearity "assumption" is really just that there is no perfect collinearity between variables.
Technically that isn't even an assumption. It's a computational annoyance if anything. Technically we can fit models where we have perfect collinearity. However if you want to do any sort of tests or confidence intervals then you can only talk about "estimable" quantities (which essentially are the corresponding quantities you could talk about in the = model without the perfect collinearity).

I think you'd usually just use the point estimate. (Others want to chime in here?) The point estimate and the mean of the replications should be very similar in any case.
Pretty much. There are some "corrected" estimates which basically look at the bias in your estimate (estimated by the difference in your estimate and the mean of your bootstrap replicates) and modify the estimate using this. But for somebody just beginning with bootstraps I would just say use the original estimate.

15. ## The Following User Says Thank You to Dason For This Useful Post:

CowboyBear (04-27-2015)

16. ## Re: Dealing with violations of the assumptions.

Originally Posted by CowboyBear
clicking a button in SPSS
well, will i be ****ed you actually can do bootstrap in SPSS for multiple AND logistic regression.

what is the world coming to...

17. ## Re: Dealing with violations of the assumptions.

Yeah you can spunky and you can in Strata as well (and R has a variety of modules to do this). SAS is really the odd software out in this case.

When I ran the averages of my bootstrap they were essentially identical to the estimates from the original non-replication regression. In a way that makes sense because, as wiser people than I noted, a bootstrap is essentially a sample (or series of samples) from the original non-replicated population. So if the sampling estimate is unbiased it should be the same as the population.

Multicolinearity may not be part of Gauss Markov, but it is commonly presents as an assumption. For example in Fox's Regression Diagnostics and In Berry's Understanding Regression Assumptions. Not knowing the slopes of a variable, which is commonly the whole point of the regression is annoying. With perfect collinearity, some text say and I have experienced in SAS, you can't estimate coefficients at all. But you won't usually encounter that with careful choice of variables.

CWB serial correlation can be detected I know by various test such as Durbin Watson, another test Durbin created that I can't remember that is more general than DW, and Box-Ljung among others. I don't tend to think of that as a lack of indpendence, although it is. I think of it as a whole different assumption since it normally only applies to time series.

Stationarity is another assumption that I did not bring up because it only applies to time series. I know the test for that, they all have power problems.