# Is there a good tool for finding the best curve to fit my data?

#### Jennifer Murphy

##### Member
Excel has tools to fit several curves to data linear, polynomial, exponential, logarithmic, etc., but not more complicated curves like Gaussian, beta, etc. Is there a software package that can do a much more comprehensive curve fitting analysis?

I think the data I am currently interested in is probably something like a truncated normal or beta. The data is the number of moves players take to solve a solitaire deal. The minimum number is 52. This game allows unlimited passes through the deck and backing out moves, so there is really no upper limit.

The data I have so far is too large to post here, so I uploaded it to this OneDrive folder.

https://1drv.ms/u/s!ArBLbKVM2K_HiPYZ7j0zHbVquaAXUA?e=V2b8BM

Can anyone recommend a tool that will analyze this data, determine the best fitting curve, and reveal the formula?

Thanks

#### Jennifer Murphy

##### Member
PS: My purpose in this is to be able calculate the difference between the actual data and the theoretical curve so I can see if the data is getting closer to the theoretical or not.

#### Jennifer Murphy

##### Member
I would like to add the suggestion of fitdistrplus in R.
https://cran.r-project.org/web/packages/fitdistrplus/index.html

Some years ago I saw the advice: "If you are thinking of using Excel for statistical analysis, don't do it".
Excel could possibly be used for the most elementary analysis.
I am sure that the R code is superior. This is just a little curiosity project, so if I can't do it in Excel, I'll probably just forget it.

Thanks

#### Jennifer Murphy

##### Member
I wonder if anyone here who has the fitdistrplus R code could take a look at my data and help me find a good fit.

I've attached a txt file with the data. I had it in a csv file, but it would not allow me to attach that one.

The data is from a series of games of Klondike Solitaire. The Moves column is the number of moves to complete the game. This is the X axis. There are 52 cards in a deck all of which must be moved up to the foundation piles. So the minimum number of moves is 52. In this version of the game, moves can be undone & redone, but they all count as moves, so there is no upper limit on the number of moves.

The Wins column is the number of games that were completed (won) in that number of moves by a player. This is the Y axis. A "1" in the "63" row means that one player won (completed) a game in 63 moves. A "3" in the "69" row, means that 3 players won a game in 69 moves. On the other end, a "3" in the "173" row means that it took 3 players 173 moves to win a game.

Here is a scatter plot of that data:

It looks like a normal distribution to me or probably a truncated normal. I tried plotting a normal distribution using the mean (85.36853448)
and std dev (7.533296572) of the data. I got this:

I tried to fit a truncated normal, but the data was all wrong.

All of this is in an Excel workbook. It won't let me attach that type of file, so I uploaded it to this OneDrive folder:

Solitaire Data in OneDrive Folder

I also looked into a Beta distribution, but couldn't get my head around it.

I would appreciate any help finding a good fit for this data.

Thanks

#### Attachments

• 970 bytes Views: 2

#### Dason

Is there a particular reason you think there should be an explicit parameterized distribution for your situation? And what do you plan on doing once you have such a formula?

#### Jennifer Murphy

##### Member
Is there a particular reason you think there should be an explicit parameterized distribution for your situation? And what do you plan on doing once you have such a formula?
Well, the data is based on the random order of the cards and the somewhat random skill of the players. The more difficult deals are also less likely, so it seems to me, based on my limited and foggy understanding of random variables, that a normal-like distribution is likely.

I'm playing around with the possibility of a novel version of the game of solitaire. It would be useful to have a target distribution for the purposes of rating the player's results. Initially, I would use the distribution to tell me if the data is getting closer to the target distribution as more data comes in. That could also tell me if I have the right distribution model. Later, I would see how it goes and how I might make use of that model.

#### Dason

I don't think a normal distribution is great here (for lots of reasons). Is there a reason the empirical distribution based on the data you have can't just work?

#### Jennifer Murphy

##### Member
Lots of reasons? Such as?

What do you mean by "empirical distribution"?

I was hoping I could find a reasonably good distribution formula so I could do something like a least squares calculation periodically to (a) determine if the accumulating data was actually getting closer to the theoretical target, (b) if a particular score was helping or hurting the overall distribution, and (b) to compare the scores from different players.

If you don't like the normal for this, is there something else that you do like? What if I convert all of the scores to probabilities and use a beta? I'm not sure how to do that, but some reading I have done suggests it might be an option. In anticipation of that option, I have created a Prob column in which I converted the wins to probabilities.

#### Dason

Beta makes even less sense. What exactly do you mean by theoretical target because honestly I don't think there is a single theoretical target for the game. Like you said each player and play style impacts things so how do you determine that "target"?

An empirical distribution is just the distribution based on the data. So you could collect all the data from the users and just compare them to each other. There really is no theoretical distribution to compare against unless you perfectly define the play strategy and figure out the probabilities for each outcome.

#### Jennifer Murphy

##### Member
I don't think a normal distribution is great here (for lots of reasons).
Hmmm... "lots of reasons"? Would it be too much trouble to explain a few of them?

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Think about where the outliers are above and the calculation of the mean and would plus or minus the standard deviation be symmetrical around the measure of central tendancy?

#### Jennifer Murphy

##### Member
Think about where the outliers are above and the calculation of the mean and would plus or minus the standard deviation be symmetrical around the measure of central tendancy?
I agree that the outliers "look" problematic, but are they? This is data about human behavior, not a biological process like height, weight, or coin tosses. Humans have bad days.

I don't think it should be symmetrical. It has a hard lower limit of 52. That's why I asked about a truncated normal.

But this doesn't address the fundamental questions:

1. Is there a test I can run that will tell me which underlying distribution best fits this data? A couple of people suggested something in R. I asked if anyone who has this package could run a test.

2. If I assume that it is a truncated normal distribution, how can I calculate the best fit?

#### hlsmith

##### Less is more. Stay pure. Stay poor.
I haven't read the above posts, but are the observations independent? Meaning a person isn't contributing a portion of these data and there are no interactions between individuals contributing data that would make values dependent?

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Determining an underlying distribution usually is based on context knowledge and assumptions. Many times people may use goodness of fits (observed versus expected values) and perhaps have hold out data to try and validate options.

P.S., Wins sounds like a count - however, if average counts (lambda) is 8 or larger, the normal distribution can be a reasonable approximations. Your overlaid normal distribution doesn't look great or terrible. What is the purpose of defining the underlying distribution and can the population or confines of the system change across time?

Yeah, beta is likely poor since bound between 0-1, though gamma dist can be flexible.

Last edited:

#### Dason

I don't personally think it looks great. For the area above the average nothing looks appropriate.

#### Jennifer Murphy

##### Member
P.S., Wins sounds like a count - however, if average counts (lambda) is 8 or larger, the normal distribution can be a reasonable approximations.
Yes, wins is a count. When a player finishes a game, the program records the number of moves. If the player took 85 moves to complete the game, the tally for 85 is incremented by 1.

Your overlaid normal distribution doesn't look great or terrible.
It's not a perfect normal, for sure, but it sure "looks" normal-ish to me. Almost all of the outliers are by one player, but I wanted to keep them in the tallies because there will certainly be other similar players.

I will soon have another dataset from different players. There are very few outliers with this group. I'll post it in a bit.

What is the purpose of defining the underlying distribution and can the population or confines of the system change across time?
The purpose is largely curiosity and education at the moment. Depending on what I learn, I may find more of a use for it.

Yeah, beta is likely poor since bound between 0-1, though gamma dist can be flexible.
From what I read, I thought I might be able to try a beta if convert the tallies to probabilities. Here's a sample:

#### Jennifer Murphy

##### Member
I don't personally think it looks great. For the area above the average nothing looks appropriate.
What you said was that there are "lots of reasons" why it is not a normal distribution. So far, you have only given one reason and that is just your opinion. It is not backed up by anything. If you really do have "lots" of other reasons, and were not just blowing smoke, I challenge you to give at least 2-3 more.