K

I have a computer model, which I need to run 3500 times with different input parameters.

The model's accuracy depends on a constant, let's call it T, which unlike the input parameters doesn't change between model runs. By decreasing T, the model's accuracy improves but the computational cost increases.

I have determined by trial-and-error that a certain value of T gives me a good compromise in terms of accuracy and computational cost.

In order to show that T gives me acceptable accuracy for all 3500 model runs, I have performed a statistical analysis on a sample of 300 randomly selected runs of the total population of 3500 runs (as an aside, the different inputs for the model runs are uniformly distributed in the unit hypercube).

I performed each of these 300 model runs twice, once with the aforementioned T and once with a slightly smaller value. By taking the difference in model output for each run, I was able to calculate a convergence error for each individual.

Please see the results in the below histogram. I have also attached the raw data in a text file.

The statistics for the data are as follows:

Mean = -0.31079

Median = 0.0196

Skewness = -1.1903

Kurtosis = 44.0934

The kurtosis value is really high. I don't know what type of distribution best describes this data really, I haven't seen anything that looks much like it.

When I try and fit different probability distributions to this data in Matlab, it seems that the Logistic distribution gives me the best R2 value. However, it still leaves a lot to be desired, when I qualitatively compare the two curves. At least it's better than the normal distribution which this data is clearly not (see below).

What type of distribution should I use? Ultimately, what I am trying to do is to work out the population mean and variance based on my sample statistics. Then I would like to say something like, "based on this mean and variance, we conclude that 95% of the model runs have a convergence error of less than 5%. This justifies our choice of T."

I can qualitatively see from the histogram that this is true at least for the sample.

So, is what I am trying to do sound statistics, or should I use another approach?

Your thoughts/help is highly appreciated!

The model's accuracy depends on a constant, let's call it T, which unlike the input parameters doesn't change between model runs. By decreasing T, the model's accuracy improves but the computational cost increases.

I have determined by trial-and-error that a certain value of T gives me a good compromise in terms of accuracy and computational cost.

In order to show that T gives me acceptable accuracy for all 3500 model runs, I have performed a statistical analysis on a sample of 300 randomly selected runs of the total population of 3500 runs (as an aside, the different inputs for the model runs are uniformly distributed in the unit hypercube).

I performed each of these 300 model runs twice, once with the aforementioned T and once with a slightly smaller value. By taking the difference in model output for each run, I was able to calculate a convergence error for each individual.

Please see the results in the below histogram. I have also attached the raw data in a text file.

The statistics for the data are as follows:

Mean = -0.31079

Median = 0.0196

Skewness = -1.1903

Kurtosis = 44.0934

The kurtosis value is really high. I don't know what type of distribution best describes this data really, I haven't seen anything that looks much like it.

When I try and fit different probability distributions to this data in Matlab, it seems that the Logistic distribution gives me the best R2 value. However, it still leaves a lot to be desired, when I qualitatively compare the two curves. At least it's better than the normal distribution which this data is clearly not (see below).

What type of distribution should I use? Ultimately, what I am trying to do is to work out the population mean and variance based on my sample statistics. Then I would like to say something like, "based on this mean and variance, we conclude that 95% of the model runs have a convergence error of less than 5%. This justifies our choice of T."

I can qualitatively see from the histogram that this is true at least for the sample.

So, is what I am trying to do sound statistics, or should I use another approach?

Your thoughts/help is highly appreciated!

Last edited by a moderator: