# Distribution of Data

#### jazzfish

##### New Member
Hello forum members,

I wonder if someone can kindly help. I have a dataset which I've uploaded and I'm trying to work out a sensible distribution. It represents the number of throws a darts player needs before he can aim for a double. The minimum possible is 8 and the maximum possible is theoretically infinite although good players would very rarely go beyond 30 or so.

I thought a lognormal distribution might fit best but would be very grateful for a second opinion. You will see that the data peaks on certain numbers of darts, presumably because certain scores (eg 180) are more common than others due to the fact that darts players have particular habits and scoring is not random.

Any thoughts/advice would be most welcome and appreciated.

#### Dason

What is your ultimate goal? Why are you trying to fit a distribution to this?

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Can you tell us what dart game you are referencing and the general rules?

Not big on opening files, can you post a histogram with overlaid kernel density.

#### jazzfish

##### New Member
Hi,

Thanks for the responses. OK maybe I should take a step back and give a bit more context.

My raw data is darts scores over the first 9 darts (or 3 visits). The maximum possible score is 501. The minimum score in the data is 75. The mean is 298.

My ultimate goal is to be able to say, given a player has a mean of x over the first 9 darts, what are the probabilities of him or her getting each score over the first 9 darts.

From there I hope to be able to calculate the probabilities for the numbers of darts a player would need to get within a double (ie scoring at least 461).

I'm now going to try to attach a couple of histograms of the raw data. Bear with me as I'm not a statistician and I also don't know how to embed images so a couple of hurdles to tackle!

#### jazzfish

##### New Member
And here's the data grouped into bands of 10:

View attachment 6777

Hope this helps? Apologies if it's too basic, I'm working in Excel and not a statistician so am a little limited in what I can do. Happy to purchase software though if there's anything that people recommend and if anyone has any videos/articles/books they think would assist me in my task I'd also appreciate that.

#### Dason

I'm doubting there is a simple parametric distribution that would meet your needs. You could just use your empirical distribution to make those calculations though I would think.

#### jazzfish

##### New Member
Thanks Dason.

Silly question perhaps but how would one go about that? Happy to read up on it/watch videos etc if you could point me in the right direction or give me something to start with?

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Thanks for the description, that helped.

General question, where is your data coming from? Also, it is assumed that a person only contributes one set of scores to the dataset. Thus, the scores are independent and not correlate within a person. Is this the case for your data?

#### jazzfish

##### New Member
Hi,

Yes the data is biased in the sense that it comes from multiple layers but some players feature more than others and obviously some are better than others etc so it's not really uniform.

I have quite a lot of data so there would be scope to use a sub-sample if that would be advisable.

#### GretaGarbo

##### Human
The minimum possible is 8 and the maximum possible is theoretically infinite although good players would very rarely go beyond 30 or so.
In the first post and the attached data the data seems to vary between 8 and 30.
But in the later shown histogram the data seems to be around 200 - 300.

Which one is correct? (And also which sheet is correct in the attached file?)

If you take your data and do (data - 8) so that the data can be 0 or larger, then maybe a Poisson model or a negative binomial distribution could be useful.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Yeah, I might go with Dason on this one. Though I wonder if this problem has already been solved somewhere given the longevity of darts.

My slight issue is the data generating process. I would imagine if people are just trying to get the biggest score, there are certain numbers that are targeted, which have other numbers right next to them. So I am guessing those that miss 20 get 1, etc. I also would imagine that left-handed versus right-handed throwers have different strategies given English or variability tendencies. I know that if I miss it is more likely to float a certain direction. Also, you have the issue of certain scores being more probable. If you have a whole lot of data, you could just parcel out each person's data to themselves. So my prior scores function to predict my future scores!

#### jazzfish

##### New Member
Hi GretaGarbo,

The original file had the raw scores over the first darts converted to estimate the number of darts required to get to within range of a double (ie the number of darts required to score 461 or more) rounded up to the nearest integer. The second file stripped it back to the raw scores as I was worried that the rounding might distort things. However, using the number of darts rather than raw scores does make the data look a bit more normal. Please see below for the distribution of darts rather than scores. Perhaps this is a better way to go afterall?

View attachment 6782

#### jazzfish

##### New Member
Hismith

I did think about your idea but for some players data will be very limited.

I wonder if a hybrid approach would work where players of a certain standard are grouped. If I did that how should I go about adjusting for players within each group (i.e. If they are slightly better or worse than the group average).

I guess what I'm saying is when you create your own distribution how do you calculate the distribution of players that do not adhere to the mean of that group?

Any thoughts much appreciated as always.

#### GretaGarbo

##### Human
If you take your data and do (data - 8) so that the data can be 0 or larger, then maybe a Poisson model could be useful.

y = data - 8

Lets assume that you have players in four levels x= 1, 2, 3 and 4.

Then you can do a Poisson regression with y as dependent variable and x as independent variable. Each x will give you a new mu (expected value) and therefore also the distribution at of y at that skill level of x.

#### jazzfish

##### New Member
So are you saying that if two players from different groups had the same mean their distributions would potentially be different due to their different grouping? That's kind of what I'm looking for.

Would it be possible to do what you suggest in Excel or do i need to buy a software package? Happy to do so but not sure which one is best.

#### GretaGarbo

##### Human
No, given that it is "confirmed" that the distribution is Poisson and the mean is the same, then it will be the same distribution.

But if the means are different (different skill levels) it can still be Poisson distributed (although with different means).

A Poisson regression model can easily be estimated with R (that together with RStudio can be downloaded free). Show us the data set (correctly specified!) and we will help you. I guess that you can estimate it in excel, since it can be done like iterative-reweighted-least-squares, but that seems like the most difficult path.

#### jazzfish

##### New Member
Ok I will do some work on the dataset so that each row has:

Player standard grouping (I'll create sensible groupings)
Number of darts -8 so that the minimum is 0

Would this data be correctly specified for the task in hand? Shall I upload it to here? I understand some people are a bit wary of file attachments.

#### GretaGarbo

##### Human
You will need to have two columns, one for the data and one for the skill level (or maybe you have several skill variabels).

I does not matter if the data variable has minimum 8 or minimum 0. But every row must correspond to one player.

I understand some people are a bit wary of file attachments.
Yes. Maybe you can just paste it in here in a code frame (highlight and click at #) and hide it (highlight and click on "HIDE"). Like this:

Code:
y  skill
8 3
9 5

Install R and RStudio. It will take 15 minutes.

#### jazzfish

##### New Member
I was thinking that every row would correspond to the lowest level of data (i.e. each row would represent an individual leg of darts which would be a number between 0 and around 30. So there would be multiple entries for particular players because the have contributed more than one piece of data.

For each row there would also be the player ability variable.

Does that make sense?