# Before to making a regression

#### luchins

##### Member
Hello I want to ask you, before to making a regression should we understand which distribution follow the datas?

Let's take an example: if the data have a poisson distribution, then I will run a Poisson regression?

If the data follow a gaussian then one should use the linear regression?

I don't understand when to use linear regression and when to use Poisson regression. Any real use of case example?

I leave you an example: In this post this authot uses the Poisson regression to build a model based on average goal made by 2 footbal teams Home and Away

https://dashee87.github.io/data science/football/r/predicting-football-results-with-statistical-modelling/

I don't get why he USED POISSON regression, instead of linear regression.... is it because the data followed a Poisson distribution in the first place?

Last edited:

#### hlsmith

##### Less is more. Stay pure. Stay poor.
In practice the difference in their use usually comes done to the type of data you have. Continuous variables, e.g., weight, get model with a linear model and count data get model with Poisson model. Goals would be considered a count, how many goals, one two. However, with large sample size and a larger average counts, many say around 8 or larger, the Poisson begins to approximate a normal distribution and a linear model can be used on those data. Also to better frame this in your head, in your soccer example see how the count of goals is bounded on the left-side by "0". That would not work well for a linear model, but if the average was closer to 20 and data not too skewed, then you could use a linear model on it.

You may already know this, but they are both considered linear models -> since you use a linear (summed) combination of terms in both model equations. Though the default is to call least squares models linear, but many other models like logistic are also technically linear.

#### luchins

##### Member
In practice the difference in their use usually comes done to the type of data you have. Continuous variables, e.g., weight, get model with a linear model and count data get model with Poisson model. Goals would be considered a count, how many goals, one two. However, with large sample size and a larger average counts, many say around 8 or larger, the Poisson begins to approximate a normal distribution and a linear model can be used on those data. Also to better frame this in your head, in your soccer example see how the count of goals is bounded on the left-side by "0". That would not work well for a linear model, but if the average was closer to 20 and data not too skewed, then you could use a linear model on it.

You may already know this, but they are both considered linear models -> since you use a linear (summed) combination of terms in both model equations. Though the default is to call least squares models linear, but many other models like logistic are also technically linear.

''However, with large sample size and a larger average counts,''

What do you mean with ''large average counts'' ? sorry not native-english... Can you make an example?

Also

''many say around 8 or larger, the Poisson begins to approximate a normal distribution and a linear model can be used on those data''

Around 8 data? So for 8 matches he could have been used a last squared instead of Poisson? For less than 8 matches is good a poisson?

''Also to better frame this in your head, in your soccer example see how the count of goals is bounded on the left-side by "0". That would not work well for a linear model, but if the average was closer to 20 and data not too skewed, then you could use a linear model on it.''

I don't understand: why the count of goals is bounded on left side (what does it mean?) by ''o''?

''

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Mean values of 8. So if you had goals and the mean was say 1 or 2 you would have your mode around there and a long tail on the right side since people could have countless goals. However they cannot have negative goals so the count is bound (unable) to go lower. This creates asymmetry in the distribution. So by eight, I was referencing the arithmetic average goals.