+ Reply to Thread
Page 2 of 2 FirstFirst 1 2
Results 16 to 29 of 29

Thread: Dealing with violations of the assumptions.

  1. #16
    Points: 4,358, Level: 42
    Level completed: 4%, Points required for next Level: 192

    Posts
    143
    Thanks
    3
    Thanked 37 Times in 34 Posts

    Re: Dealing with violations of the assumptions.




    Quote Originally Posted by noetsi View Post
    When I conduct analysis I am rarely interested in what the slope of a variable is. I want to know, or those who I run the data for want to know, if the results are signficant or not. The literature I have seen in the social sciences stresses the same, this variable was signficant or was not. So the test not the effect size is the driving force. Which variable is relatively more important is critical to me, but slopes don't tell you that (as dason has reminded me on more than one occasion).
    The test just gives you a yes/no answer. If you want to look at relative impact of different variables, you do need to look at the standardized betas.

  2. #17
    TS Contributor
    Points: 14,811, Level: 78
    Level completed: 91%, Points required for next Level: 39
    Miner's Avatar
    Location
    Greater Milwaukee area
    Posts
    1,171
    Thanks
    34
    Thanked 405 Times in 363 Posts

    Re: Dealing with violations of the assumptions.

    Quote Originally Posted by noetsi View Post
    I am not sure what the equivalent test would be for the data I run (which is not an industrial process).

    I am not sure what you mean by plotting the residuals in both time sequences or by fits?
    These are residuals diagnostic plots from Minitab. In addition to the normality test, residuals are plotted against fits to examine for heteroskedacity or curvature. And, if the data are taken in time sequence, the time order plot is examined for shifts and trends that might indicate a latent variable. If the data are not in time sequence, this last graph is ignored.

    This site has a good discussion on the residuals vs. fits analysis.

    This site discusses both residuals vs. fits and residuals vs. time order.
    Attached Images  

  3. The Following User Says Thank You to Miner For This Useful Post:

    noetsi (04-24-2015)

  4. #18
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Dealing with violations of the assumptions.

    Quote Originally Posted by Injektilo View Post
    The test just gives you a yes/no answer. If you want to look at relative impact of different variables, you do need to look at the standardized betas.
    There is some debate if the standardized betas are adequate measures of relative impact even in linear regression. But the real problem occurs in logistic regression where there are no agreed on standardized betas and where the ones that have been proposed generate signficantly different results in some cases. Unfortunately for me, much of my analysis is with logistic regression even though this thread is not on that.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  5. #19
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Dealing with violations of the assumptions.

    Thanks miner. I knew of plotting residuals against fitted values. I had not heard of the time series variant (probably because I mainly work in ESM and assumptions are largely ignored in that form of time series).
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  6. #20
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Dealing with violations of the assumptions.

    Here is a point that has long puzzled me. Does this mean that if we have samples of say 300 (none of mine would ever be less than that) it does not matter if your data is non-normal for Regression? I would say I had read this in many sources, which I have, but than we would have to define "many"

    In the field of statistics , there are lots of methods that are practically guaranteed to work well if the data are approximately normally distributed and if all we are interested in are linear combinations of these normally distributed variables. In fact, if our sample sizes are large enough we can use the central limit theorem which tells us that we would expect means to converge on normality so we do not even need to have samples from a normal distribution as N increases. So if we have two groups of say 100 subjects each and we are interested in mean change from baseline of a variable then we have no need to worry and can apply standard statistical methods with only the most basic of checks for statistical validity.
    http://www.lexjansen.com/phuse/2005/pk/pk02.pdf

    I note this only applies to means. I assume this does apply to slopes, but I am not sure if it applies to say standard errors for example.
    Last edited by noetsi; 04-24-2015 at 02:39 PM.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  7. #21
    Omega Contributor
    Points: 38,289, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,992
    Thanks
    397
    Thanked 1,185 Times in 1,146 Posts

    Re: Dealing with violations of the assumptions.

    I intuitively feel if the means are converging then the SEs are converging.

    You should have prefaced that this is in regards to bootstrapping, at least that is what the link is to. This ends up being a sampling question. You get this convergence with repeated sampling even in non-normal data. Though, if the variable in the population was non-normal making the sample size bigger will not end up having that awesome effect of normality as resampling does.
    Stop cowardice, ban guns!

  8. #22
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Dealing with violations of the assumptions.

    If I understand the author correctly they are commenting on ANOVA in that comment although the link in general is about bootstrapping. It is tied not to bootstrapping but to the central limit theorem. She uses bootstrapping only when the CLM does not apply
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  9. #23
    TS Contributor
    Points: 18,889, Level: 87
    Level completed: 8%, Points required for next Level: 461
    CowboyBear's Avatar
    Location
    New Zealand
    Posts
    2,062
    Thanks
    121
    Thanked 427 Times in 328 Posts

    Re: Dealing with violations of the assumptions.

    Quote Originally Posted by noetsi View Post
    Here is a point that has long puzzled me. Does this mean that if we have samples of say 300 (none of mine would ever be less than that) it does not matter if your data is non-normal for Regression?
    If the errors are independent then as the sample size grows larger, the sampling distribution of the coefficients (e.g., the slopes) will converge to a normal distribution regardless of whether or not the errors are normally distributed. If the sampling distribution is normal, then confidence intervals and significance tests will be trustworthy. It's hard to put a number on exactly how big a sample is large enough that you can be sure that the sampling distribution will be sufficiently approximated by a normal distribution, because it does depend on stuff like what the distribution of the errors actually is. But with a sample size of 300 the error distribution would have to be something truly evil and pathological for it to make the slightest bit of difference. If I was in your shoes I'd worry much more about other assumptions like whether the errors have conditional means of zero, and whether they're independently and identically distributed. Breaches of these assumptions have much more serious consequences.

  10. The Following User Says Thank You to CowboyBear For This Useful Post:

    noetsi (04-24-2015)

  11. #24
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Dealing with violations of the assumptions.

    I want to make sure I don't misunderstand this CWB. Given sample sizes of 300 or more (and in honesty mine commonly have 1000's) a non-normal distribution is not going to effect either the CI or the test of signficance greatly. If so, well I still am learning bootstrapping so some good came out of this

    The assumptions I have focused on are multicolinearity, linearity, leverage (although with the size of the samples I have it is unlikely any small set of points is going to move the regression line much), and equal error variance. Independence of observations is important as well, but my understanding of that is there is no diagnostic that will catch it - you have to deal with this in your design and I have no knowledge/control at all of data collection. There is no way to imagine my errors not being independent, but no way I can think of to verify that. The data is being collected by hundreds of different counselors across the state and analysis is run regularly to remove invalid results.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  12. #25
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Dealing with violations of the assumptions.

    A follow up question for bootstrapping. Ok so you generate say a 1000 replications of your data set and you run regression on each. So what are your slopes for each IV? The average of the 1000 slopes (given how means are handled in boostrapping I assume so, but I have not found anyone yet who says so). I am not sure you would want to do this because the point estimate is correct even without normality given a reasonable sized sample.

    More importantly, how do you conduct the test of statistical signficance? You obviously have a 1000 p values - but how do you go from that to whether the slope is statistically signficant overall?

    I guess you could calculate the CI from the bootstrap around the averaged slope (or around the original model you run without a bootstrap) and see if it contains 0 Somehow that does not seem right.
    Last edited by noetsi; 04-26-2015 at 02:01 PM.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  13. #26
    TS Contributor
    Points: 18,889, Level: 87
    Level completed: 8%, Points required for next Level: 461
    CowboyBear's Avatar
    Location
    New Zealand
    Posts
    2,062
    Thanks
    121
    Thanked 427 Times in 328 Posts

    Re: Dealing with violations of the assumptions.

    Quote Originally Posted by noetsi View Post
    I want to make sure I don't misunderstand this CWB. Given sample sizes of 300 or more (and in honesty mine commonly have 1000's) a non-normal distribution is not going to effect either the CI or the test of signficance greatly. If so, well I still am learning bootstrapping so some good came out of this
    Yep! And that's awesome

    The assumptions I have focused on are multicolinearity, linearity, leverage (although with the size of the samples I have it is unlikely any small set of points is going to move the regression line much), and equal error variance.
    The multicollinearity "assumption" is really just that there is no perfect collinearity between variables. Less severe multicollinearity isn't so much an assumption breach (if the other assumptions are met the estimates remain consistent, unbiased and efficient, and significance tests and confidence intervals are trustworthy). It's just that multicollinearity increases your standard errors (they're not wrong, they're just larger than they would be if the multicollinearity isn't present). I'd agree with not worrying about leverage with such huge sample sizes, and again that isn't so much a distributional assumption per se as more of a general potential problem.

    Independence of observations is important as well, but my understanding of that is there is no diagnostic that will catch it - you have to deal with this in your design and I have no knowledge/control at all of data collection. There is no way to imagine my errors not being independent, but no way I can think of to verify that. The data is being collected by hundreds of different counselors across the state and analysis is run regularly to remove invalid results.
    Eh, it depends on the type of non-independence. There are some things you can catch with diagnostic tests (e.g., autocorrelated errors in a time series study). But yeah in general it can be a hard thing to catch.

    A follow up question for bootstrapping. Ok so you generate say a 1000 replications of your data set and you run regression on each. So what are your slopes for each IV? The average of the 1000 slopes (given how means are handled in boostrapping I assume so, but I have not found anyone yet who says so). I am not sure you would want to do this because the point estimate is correct even without normality given a reasonable sized sample.
    I think you'd usually just use the point estimate. (Others want to chime in here?) The point estimate and the mean of the replications should be very similar in any case.

    More importantly, how do you conduct the test of statistical signficance? You obviously have a 1000 p values - but how do you go from that to whether the slope is statistically signficant overall?

    I guess you could calculate the CI from the bootstrap around the averaged slope (or around the original model you run without a bootstrap) and see if it contains 0 Somehow that does not seem right.
    Personally I'd edge toward using confidence intervals anyway so would use the latter strategy, and I think that'd be quite standard. But I think it is possible to obtain p values via bootstrapping in a couple different ways - e.g., see this post.

  14. #27
    Devorador de queso
    Points: 95,540, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,930
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: Dealing with violations of the assumptions.

    Quote Originally Posted by CowboyBear View Post
    The multicollinearity "assumption" is really just that there is no perfect collinearity between variables.
    Technically that isn't even an assumption. It's a computational annoyance if anything. Technically we can fit models where we have perfect collinearity. However if you want to do any sort of tests or confidence intervals then you can only talk about "estimable" quantities (which essentially are the corresponding quantities you could talk about in the = model without the perfect collinearity).

    I think you'd usually just use the point estimate. (Others want to chime in here?) The point estimate and the mean of the replications should be very similar in any case.
    Pretty much. There are some "corrected" estimates which basically look at the bias in your estimate (estimated by the difference in your estimate and the mean of your bootstrap replicates) and modify the estimate using this. But for somebody just beginning with bootstraps I would just say use the original estimate.
    I don't have emotions and sometimes that makes me very sad.

  15. The Following User Says Thank You to Dason For This Useful Post:

    CowboyBear (04-27-2015)

  16. #28
    TS Contributor
    Points: 22,359, Level: 93
    Level completed: 1%, Points required for next Level: 991
    spunky's Avatar
    Location
    vancouver, canada
    Posts
    2,135
    Thanks
    166
    Thanked 537 Times in 431 Posts

    Re: Dealing with violations of the assumptions.

    Quote Originally Posted by CowboyBear View Post
    clicking a button in SPSS
    well, will i be ****ed you actually can do bootstrap in SPSS for multiple AND logistic regression.

    what is the world coming to...
    for all your psychometric needs! https://psychometroscar.wordpress.com/about/

  17. #29
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Dealing with violations of the assumptions.


    Yeah you can spunky and you can in Strata as well (and R has a variety of modules to do this). SAS is really the odd software out in this case.

    When I ran the averages of my bootstrap they were essentially identical to the estimates from the original non-replication regression. In a way that makes sense because, as wiser people than I noted, a bootstrap is essentially a sample (or series of samples) from the original non-replicated population. So if the sampling estimate is unbiased it should be the same as the population.

    Multicolinearity may not be part of Gauss Markov, but it is commonly presents as an assumption. For example in Fox's Regression Diagnostics and In Berry's Understanding Regression Assumptions. Not knowing the slopes of a variable, which is commonly the whole point of the regression is annoying. With perfect collinearity, some text say and I have experienced in SAS, you can't estimate coefficients at all. But you won't usually encounter that with careful choice of variables.

    CWB serial correlation can be detected I know by various test such as Durbin Watson, another test Durbin created that I can't remember that is more general than DW, and Box-Ljung among others. I don't tend to think of that as a lack of indpendence, although it is. I think of it as a whole different assumption since it normally only applies to time series.

    Stationarity is another assumption that I did not bring up because it only applies to time series. I know the test for that, they all have power problems.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

+ Reply to Thread
Page 2 of 2 FirstFirst 1 2

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats