+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 15 of 23

Thread: Prediciting accidents

  1. #1
    Points: 744, Level: 14
    Level completed: 44%, Points required for next Level: 56

    Posts
    5
    Thanks
    2
    Thanked 0 Times in 0 Posts

    Prediciting accidents




    I have some data which measures the amount of hazard inspections completed in a workplace. I want to find whether there is a relationship between the number of hazard inspections completed and whether that predicts the amount of accidents that occurs.

    Beyond a Pearson r correlation, what other statistic(s) could I use to inform this relationship or make predictions? t-test? ANOVA? I also have other data which may or may not be related to accident data so I would like to include that with any measure of correlation or other statistic.

    Thanks in advance.

  2. #2
    Points: 3,631, Level: 37
    Level completed: 88%, Points required for next Level: 19
    staassis's Avatar
    Location
    New York
    Posts
    226
    Thanks
    2
    Thanked 41 Times in 39 Posts

    Re: Prediciting accidents

    This is a classic case for a generalized linear model (GLM). The exact specification (distribution & link function) of the model must be chosen based on a well-established model selection criterion, like AIC or cross-validation. However, a good starting point would be trying Poisson regression.

  3. The Following User Says Thank You to staassis For This Useful Post:

    TimA (04-26-2014)

  4. #3
    Points: 744, Level: 14
    Level completed: 44%, Points required for next Level: 56

    Posts
    5
    Thanks
    2
    Thanked 0 Times in 0 Posts

    Re: Prediciting accidents

    Thanks. I've heard poisson regression being used before in accident analysis but i haven't had much experience with it, but i'll look into it some more. Would multiple regression be applicable in this case as well?

  5. #4
    Points: 3,631, Level: 37
    Level completed: 88%, Points required for next Level: 19
    staassis's Avatar
    Location
    New York
    Posts
    226
    Thanks
    2
    Thanked 41 Times in 39 Posts

    Re: Prediciting accidents

    Quote Originally Posted by TimA View Post
    Would multiple regression be applicable in this case as well?
    No, multiple linear regression would not be applicable.

  6. The Following User Says Thank You to staassis For This Useful Post:

    TimA (04-27-2014)

  7. #5
    Points: 744, Level: 14
    Level completed: 44%, Points required for next Level: 56

    Posts
    5
    Thanks
    2
    Thanked 0 Times in 0 Posts

    Re: Prediciting accidents

    Thanks, i'll get my head around poisson regression.

  8. #6
    Points: 3,631, Level: 37
    Level completed: 88%, Points required for next Level: 19
    staassis's Avatar
    Location
    New York
    Posts
    226
    Thanks
    2
    Thanked 41 Times in 39 Posts

    Re: Prediciting accidents

    Good luck... and do not forget to try other generalized linear models as well.

  9. #7
    Points: 744, Level: 14
    Level completed: 44%, Points required for next Level: 56

    Posts
    5
    Thanks
    2
    Thanked 0 Times in 0 Posts

    Re: Prediciting accidents

    Thanks - any recommendations on where to upskill in this area. Basically my stats skills stop at 3rd year Psych, t-tests, ANOVA etc, hoping its not too big a jump...

  10. #8
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Prediciting accidents

    I am trying to figure out why linear regression could not be used to predict the number of accidents. It is used for similar analysis all the time.... The number of accidents is certainly an interval variable.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  11. #9
    Devorador de queso
    Points: 95,781, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,933
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: Prediciting accidents

    Quote Originally Posted by noetsi View Post
    I am trying to figure out why linear regression could not be used to predict the number of accidents. It is used for similar analysis all the time.... The number of accidents is certainly an interval variable.
    It's a count. Simple linear regression assumes the error term is normally distributed. Clearly a count variable won't be able to give us a normal distribution so using a generalized linear model where you can actually have the response distribution be something like poisson or negative binomial would be more appropriate.
    I don't have emotions and sometimes that makes me very sad.

  12. #10
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Prediciting accidents

    That's fasinating to know because linear regression is used for counts, say number of violent crimes or sales or economic transactions all the time. Including in the journals I have seen. Virtually all economic and social science analysis uses linear regression (or time series) and virtually all economic/social science data is a count of something.

    I always assumed you could simply look at the residuals to see if they are normally distributed.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  13. #11
    TS Contributor
    Points: 12,227, Level: 72
    Level completed: 45%, Points required for next Level: 223
    rogojel's Avatar
    Location
    I work in Europe, live in Hungary
    Posts
    1,470
    Thanks
    160
    Thanked 332 Times in 312 Posts

    Re: Prediciting accidents

    hi,
    I think the difference could be in the range of possible values. For accidents you might have numbers between 0 and 10, say, so the Poisson would be a better model. if I had hundreds of transactions or other macroeconomic data with counts but a large range, a continuous model would be entirely satisfactory.

    Do i see it wrong?

    regards
    rogojel

  14. #12
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Prediciting accidents

    According to what dason said in chat a key question is how many cases you have. Similarly much of what seems counts to me, such as dollars spent, is not a count apparently in this context.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  15. #13
    TS Contributor
    Points: 12,227, Level: 72
    Level completed: 45%, Points required for next Level: 223
    rogojel's Avatar
    Location
    I work in Europe, live in Hungary
    Posts
    1,470
    Thanks
    160
    Thanked 332 Times in 312 Posts

    Re: Prediciting accidents

    The number of cases is the same as the range of possible values I guess.

  16. #14
    Human
    Points: 12,676, Level: 73
    Level completed: 57%, Points required for next Level: 174
    Awards:
    Master Tagger
    GretaGarbo's Avatar
    Posts
    1,362
    Thanks
    455
    Thanked 462 Times in 402 Posts

    Re: Prediciting accidents

    Quote Originally Posted by Dason View Post
    Simple linear regression assumes the error term is normally distributed.
    Yes and no.

    Linear regression assumes customary that the random errors are independent and have constant variance. Then OLS estimates will be “good” according to the Gauss-Markov theorem (“best linear unbiased estimate”). But then if you want to test if a parameter is different from zero you need to include the additional assumption that the random error are also normally distributed. Is that included in what is meant by a “linear regression model”?

    Maybe this is just a play with words.

    I remember I had a discussion with Noetsi about the expression “OLS regression”. I said that it was incorrect nomenclature. All sciences are trying to define their concepts. Like what is “force”, “energy”, “a planet”, “a cell” and so on. There was a vote on nomenclature among astronomers a few years ago and one planet, Pluto, disappeared!

    I meant that “linear regression” is a model description and that OLS, or rather, least squares, is an estimation method. I thought that it was contradictory to say “the OLS regression model was estimated with weighted least squares”. Noetsi claimed that that was used by a prominent author, “Fox” I believe his name was. I doubted that! So, I went to the library and picked up the book and to my surprise Fox used the expression “OLS regression”. But then already a couple of days had passed and Noetsi had written so many other posts that I could simply not find that post and respond to it. So I have taken the opportunity here.

    As I thought about it I changed my mind a little bit. I thought that it was OK to say “Poisson regression”, but that is because it describes the used distribution, or “logit regression”, because that describes the used link function. Maybe it does not matter what it is called....

    In a Poisson model the variance is equal to the population mean. So increasing the mean will increase the variance. And in a regression model the mean is described by “beta*x”, so if x increases, the mean in the dependent variable will increase, and therefore also the variance. The variance will not be constant as in the standard linear regression. Estimating a Poisson regression model with least squares will not give bad estimates. They will still be unbiased but not as efficiently estimated as they could have been.

    In maximum likelihood estimation of a generalised linear model, like a “Poission regression model”, the estimates are done with iteratively re-weighted least squares. As “betahat*x” is estimated, then also the mean for that observation is estimated, thus also the variance in that observation is estimated. And that gives a new corrected weight, high variance gives low weights, for that observation (and weights are given for all observations). Then with the new weights a new iteration of estimation of “betahat*x” is done. Most of the time the process converges in just 5, 6 iterations.

    Also, to test if a parameter is “significant”, a test statistic derived from the distribution is needed. There fore the distribution need to be specified. Like Dason said above about the normal distribution in “linear regression”.

    So, now I have tried to explain a Poisson regression model, which is one example of a generalised linear model. So estimating a Poisson regression model with least squares would not be so bad and calling it “a linear regression” would not be so “wrong” but maybe not completely correct.

  17. #15
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Prediciting accidents


    I suspect that the term OLS regression is a legacy particular in economics. For instance Chris Brooks (a Cambridge don) "Introductory Econometrics for Finance" uses OLS regression throughout for what is clearly linear regression. I had seen it so often that until GretaGarbo raised this point it never occured to me to be an issue (I changed my own usage to linear regression since then).

    Estimating a Poisson regression model with least squares will not give bad estimates. They will still be unbiased but not as efficiently estimated as they could have been.
    My concern is that linear regression seems to be used primarily in the literature and the vast majority of what I would consider counts (including nearly all economic data) is in fact run with either linear or logistic regression (the latter would not be used however with the type of data we are talking about here). This is commonly recommended in text as well (Poisson regression almost never comes up and GLS only rarely). It was a shock to me to find out it might not be valid to use linear regression with count data which is very common.

    Of course based on chat discussions it is not clear to me what fits the term count. To me it seems that logically these are non-divisable (aka discrete) data and most social science and economic data that is not a percent would meet this defintion. It appears based on chat, however, that if the number of cases is very high the concerns with using linear or logistic regression with counts is not that great. Effectively the data can be treated as "interval like" and linear models applied. One can always run test of normality and constant variance - it would seem regardless of the data type if your data matches these assumptions or close you can use linear regression.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

+ Reply to Thread
Page 1 of 2 1 2 LastLast

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats