# What type of distribution is this and am I being sensible?

K

#### Kobye

##### Guest
I have a computer model, which I need to run 3500 times with different input parameters.

The model's accuracy depends on a constant, let's call it T, which unlike the input parameters doesn't change between model runs. By decreasing T, the model's accuracy improves but the computational cost increases.

I have determined by trial-and-error that a certain value of T gives me a good compromise in terms of accuracy and computational cost.

In order to show that T gives me acceptable accuracy for all 3500 model runs, I have performed a statistical analysis on a sample of 300 randomly selected runs of the total population of 3500 runs (as an aside, the different inputs for the model runs are uniformly distributed in the unit hypercube).

I performed each of these 300 model runs twice, once with the aforementioned T and once with a slightly smaller value. By taking the difference in model output for each run, I was able to calculate a convergence error for each individual.

Please see the results in the below histogram. I have also attached the raw data in a text file. The statistics for the data are as follows:

Mean = -0.31079
Median = 0.0196
Skewness = -1.1903
Kurtosis = 44.0934

The kurtosis value is really high. I don't know what type of distribution best describes this data really, I haven't seen anything that looks much like it.

When I try and fit different probability distributions to this data in Matlab, it seems that the Logistic distribution gives me the best R2 value. However, it still leaves a lot to be desired, when I qualitatively compare the two curves. At least it's better than the normal distribution which this data is clearly not (see below). What type of distribution should I use? Ultimately, what I am trying to do is to work out the population mean and variance based on my sample statistics. Then I would like to say something like, "based on this mean and variance, we conclude that 95% of the model runs have a convergence error of less than 5%. This justifies our choice of T."

I can qualitatively see from the histogram that this is true at least for the sample.

So, is what I am trying to do sound statistics, or should I use another approach?

Your thoughts/help is highly appreciated!

Last edited by a moderator:

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Just joking but they need to come up with the FU distribution for the extreme mean peaks. This looks like the cover of a kid rock album.

#### rogojel

##### TS Contributor
hi,
how about using a non-parametric approach? IMHO there is no point in using a sample here, as you are really throwing away information you already gathered .
i would just calculate the span between P5 and P95 from all data points and report that number.

regards
rogojel

K

#### Kobye

##### Guest
Thanks for your input rogojel. I've only performed the convergence analysis for 300 of the model runs. As mentioned, this required me to perform each of these runs with given input parameters twice. The rest of the 3500 model runs, I ran only once each so I don't know their convergence error.

The type of computer model I am talking about takes about half an hour to run. It is in my interest not to have to run it another 3200 times for the entire population.

Last edited by a moderator:

#### Miner

##### TS Contributor
How were the data collected? Was it in time sequence or did you sort it? If it is in time sequence, it is monotonically increasing along a very symmetrical curve that appears very artificial.

K

#### Kobye

##### Guest
EDIT: The raw data in the attached text file is sorted.

The data was collected as follows:

My computer model takes as input a set of 2 independent variables (in reality there are more, but let's assume just 2 for simplicity's sake).

i.e. if my model is f(x) then x = {a,b}.

0<a<1
0<b<1

{a,b} is sampled uniformly on the unit square, so if you imagine a on the x-axis and b on the y-axis, my input space looks like: So effectively I now have 3500 input vectors, {a,b}, each dot on the image above representing one such vector.

From these 3500 input vectors, I choose 300 by sampling uniformly at random, without replacement.

For each of the 300 input vectors in the sample, I calculated a convergence error, which is very costly to do (many hours of CPU time). This is in effect my data.

I would now like to use what I know about the accuracy of the model for each of these 300 input vectors, to say something about the accuracy of the model for the entire population of 3500 input vectors.

Last edited by a moderator:
K

#### Kobye

##### Guest
I just found that the Laplace (double exponential) distribution gives me a much better fit. Would this be a reasonable distribution to use? The only thing that makes me hesitate is that my skewness, at -1.1 is less than -1 as is typically recommended for a symmetrical distribution.

#### Miner

##### TS Contributor
I would map the extreme highs and extreme lows on the unit square in different colors to see if specific zones cause the extreme values.

K

#### Kobye

##### Guest
Hi Miner, thanks for your input. Well, there's a good chance that the extreme values are caused by the data from the edge of the unit square.

Could you kindly explain to me how this knowledge might aid my statistical analysis?

At the end of the day, my population of 3500 is fixed. I generated it once using well respected methods, and it is what I have to work with. I am actually very interested in the data points on the edge of the unit square, as this work is part of a global sensitivity analysis.

#### GretaGarbo

##### Human
Could it be that when you create your variable you have created a t-distribution? If your formulas are inspected (sorry I didn't immediately understand your description above) then is it possible that you derive a t-distribution?

O could it be that you are dividing one normal distribution with an other? That would give you a Cauchy distribution?

It looks like from your first diagram that there are heavy tails, more heavy than the normal distribution.

The t-distribution and the Cauchy distribution have heavy tails, the lower the degrees of freedom the heavier tail. In fact the Cauchy is a t-distribution with one degrees of freedom.

Also you could estimate a power exponential distribution. That have the Laplace distribution (also called the double exponential distribution) and the normal distribution as special cases.

Both can be estimated with the gamlss package in R. (Also a skewness parameter can be estimated there.)

- - -

Rereading the OP: Maybe this is just a part of the whole investigation that you really want to do with the simulation. It is better to tell the whole story than just trying to solve a small part of it.

K

#### Kobye

##### Guest
Hi Greta, fantastic response! Well, although I'm not sure it's relevant to the topic, I am using the extended fourier amplitude sensitivity test (eFAST, Saltelli et al, 1999) to calculate the sensitivity of my model to its inputs. In order to do this, I need to run the model for a number of different input vectors, as I have described.

The following is an aside covering the sensitivity analysis (SA) that I am performing, not the separate statistical analysis that is the topic of this thread.
In the SA, the way the input vectors are chosen depends on the number of input parameters and number of runs that the user can reasonably perform, and is done so that it fills the unit hypercube in an optimal way. This allows us to truly explore the design space. Let our design space be the unit square. This 2-dimensional space can be mapped to a mono-dimensional space (i.e. a curve) that, by Weyl's theorem, drives arbitrarily close to any point x (i.e. input vector) of the input domain. This transformation is done at a different frequency wi for each input variable xi. For the 2-dimensional domain, we take w={11,21}. Various summary statistic for the entire space can hence be calculated by integrating this curve (now a mono-dimensional rather than multi-dimensional integral!) The calculated statistics show different periodicities associated with each input variable's frequency, wi. This is the basis for computing a sensitivity measure.

I think your question asking how I came up with the 3500 input vectors could be relevant to my statistical analysis, however.

The space-filling curve s extends from 0 to 2Pi in steps of 2Pi/Ns where Ns is the sample size. Although for my actual model (which has more than 2 inputs) it's 3500, in the example below I've used Ns=800.

For each point along this curve, I calculate {a,b} i.e. x(s)={a(s),b(s)}.
The formula by which a, b (in other words x1 and x2) are calculated is given in the eFAST paper and is:
xi(s)=1/2 + 1/Pi *arcsin(sin(wi*s + phi) where phi is a phase shift parameter, randomly sampled in (0,2Pi] I hope this adds something to the discussion! I wouldn't know whether this method of sampling causes a t-distribution.

Last edited by a moderator:
K

#### Kobye

##### Guest
I followed up on your advice and tried fitting a t location-scale distribution. This seems to fit very well indeed. It also alleviates my skewness and kurtosis concerns! #### GretaGarbo

##### Human
I followed up on your advice and tried fitting a t location-scale distribution. This seems to fit very well indeed. It also alleviates my skewness and kurtosis concerns!
It is nice to hear that it fits well! But it can't make the kurtosis to disappear. If it is in the data it will still be there.

Kobye you said you had a "convergence error %". That percent thing looks worrying. If you divide it with something that is very small sometimes, you will get big numbers. I would prefer to not use percentage. It is customary in statistics to use actual values.

You have used a variable you called T, maybe a threshold or something. I would prefer to call that theta, like a statistical parameter.

From your sample of 300 you have done two experiment (or computations) where the response is Y1 and Y2:

Y1 = g(x1, x2; theta) and Y2 = g(x1, x2; theta+delta)

From this you created a residual: e = Y1 - Y2
(You called that convergence error.)

Miner said:
I would map the extreme highs and extreme lows on the unit square in different colors to see if specific zones cause the extreme values.
I agree! That could give a clear visual impression if the "spread" of the residuals varies over the x1, x2 space.

And Kobye agreed:
Well, there's a good chance that the extreme values are caused by the data from the edge of the unit square.
If it would be so that the kurtosis or skewness or variance is different over different regions of x1 and x2 then that good to know, and important for evaluating the quality for of the larger group of 3500.

in gamlss it is possible to estimate how the mean, variance, skewnes adn kurtosis is influenced by variables like x1 x2. You have a distribution suggestion, the t-distribution or power exponential, that seems to work.

I don't know if your language is using the expression 'bathtub curve', something that is very flat in the middle but rises steeply at the edges. Maybe the variance (or kurtosis) is like that here.

Tagushi design: I am not sure if I agree that theta should be kept constant. Or delta. Maybe there is room for a robust construction design here. More about that later.

K

#### Kobye

##### Guest
Again, a very high quality reply. You've also made it clear to me what Miner meant before. That would sure be interesting to check.

I certainly agree that kurtosis is inherent to the data and is not affected by what type of distribution I decide to use. What I meant was that for a t-distribution (and I'm assuming this is the same for a t Location-Scale distribution, but I should check), if nu < 1 (as it is for my data) then skewness and kurtosis are undefined. So I'm no longer worried that the kurtosis is 41 when e.g. the Logistic distribution is supposed to have a kurtosis of 3. Maybe this is a foolish way of thinking.

I also agree that percentage isn't a great random variable, but let me motivate why I am using it.

My model is basically a numerical code that finds an approximate solution to a boundary value problem involving several partial differential equations. The accuracy of the solution is affected by the discretization of the domain... a finer mesh will cause the solution to converge to the true value, but can severely impact computation time. So my tuning parameter (I previously called it T but lets continue with theta) is really just the average mesh size.

The model output that I'm considering is the displacement of a structure at a certain point in the domain. In my field of computational mechanics, we define a convergence error as (Y2 - Y1)/Y1... when this convergence is sufficiently small, typically less than 5%, we consider the solution to be reliable.

So the reason I am reporting a percentage is because this is the established way of demonstrating that the model is converged. I have checked the raw data and there are no obvious suspiciously small values of Y1 that could be causing extreme values in my data.

Ideally I would have a closed-form model and I wouldn't need to worry about convergence or whether I've made a good choice of theta. In that case, as you put it;

Y = g(x1, x2) -- eqn.1

However, my domain is irregular so it doesn't have a closed form solution. I must use approximate numerical techniques to solve. As a result, I have

Y = g(x1, x2; theta) -- eqn.2

By trial and error and a good dose of conservatism, I chose a value of theta that should provide me reliable results for practically all of my runs. This will allow me to keep theta fixed and effectively treat eqn. 2 as though it were eqn. 1. What I'm trying to do now, I suppose, is use estimation theory to justify that the vast majority of my model runs are reliable. The data seems to be showing me this, although I am open to other interpretations.

K

#### Kobye

##### Guest
Although... As I mentioned my model actually has more than 2 inputs -- 14 to be precise. These are uniformly distributed in the unit hypercube.

However, I can't plot a 14-dimensional cube... How could I visually or otherwise inspect which regions of my input space are causing the few outliers (convergence errors >5%)?

This would be interesting to know, because there are only very few of these and if I know in which regions they occur, I could rerun just these models with a finer mesh and virtually eliminate error in my experiment.

#### Miner

##### TS Contributor
If you understand the response by region , you could sample less densely in the flat region and more densely in the regions where you see more variability thus keeping the number of calculations to an acceptable level.

#### GretaGarbo

##### Human
Well you said that your x-variables were uncorrelated. You need to check this. Print the correlation matrix for x1, x2, ...,x14. Also, it is always nice to do graphs, so a scatter plot matrix (SPLOM) is always nice to have a look at. If they are uncorrelated then the two dimensional plot of (xi, xj) will be like a marginal distribution from the multivariate 14 dimensional distribution. Thus, you can plot x1 versus x2 and have different colours for the size of e. So such a graph can have a value. (It would be nice if you posted it here.) It will be many pairs (14*13/2 =91) but plot say, five or ten.

For each of your 300 pairs of residuals (ei) compute the sample variance s^2. To not let it vary to much, take the log, like: lv = log(s^2) where lv is "log-variance". Then you will have 300 "observations" of lv.

If it is true that the "variability" is constant over the whole region of x1,...,x14, the you expect lv to not be influenced by the x-variables. So estimate the regression model:

lv = b0 + b1*x1+b2*x2+ ...+b14*x14 + b11*x1^2+ b22*x^2+ ...+b1414*x14^2 + residual

plot the curve for: lv=b0 + b1*x1+b11*x1^2

You expect the bii to be positive so that the curve bends upwards towards the edges. If any of the b-terms is significant (except for b0) then you have evidence that the "variability" is not constant, so that there is something to gain.

Omitted in the above model is all the 91 pairs of interaction (like: b37*x3*x7+...+b59*x5*x9...). If the main effect or the squared effect is significant then include that interaction effect.

(If you think it is more natural you can centre the variables, like: x1new = x1-0.5, since 0.5 is the centre and mean in the design.)

The above regression model, the "response surface" as it is often called, can be thought of as a multidimensional Taylor series expansion, where the bij-coefficients estimates the derivatives.

Well, there's a good chance that the extreme values are caused by the data from the edge of the unit square.
One could try to reduce the problem to a one dimensional problem by computing the distance from the centre. Let r be one points distance from the centre: r^2 =((x1-0.5)^2+(x2-0.5)^2+...+(x14-0.5)^2)

Then plot the residuals ei versus r, and si^2 vs r, and lvi vs r, and do the regression: lv = a0+a1*r+a2*r^2 + rest

You can also do histogram for values r>0.75 say and another for those where r<0.75, Or slice it up for every r 0-0.10, 0.10-0.20... and do boxplots and histograms for that.

Still I don't accept to take the percentage of or residuals, like in ei/Yi. (You can do that at the last step as a description.) Suppose if the variation is not constant for all Y.

It can be good to plot ei versus Yi and look if there is any pattern. Increasing variation for larger Y, thus heteroscedasticity. A standard plot is to plot the lv versus Y, to check for heteroscedasticity.

#### GretaGarbo

##### Human
If you understand the response by region , you could sample less densely in the flat region and more densely in the regions where you see more variability thus keeping the number of calculations to an acceptable level.
I agree. But right now Kobye has got 300 pairs that is supposed to say something about the 3500 values (the "population") I guess she is not going to be allowed to, and get the money for, to run additional runs right now. I guess that first it is needed evidence for where the results are accurate and where it is not. Later on maybe other methods can be used, including varying the theta parameter.

A more modern method to the response surface could be non-parametric smoothing like generalised additive models (gam). That would estimate a smoothed curve among the 14 x-variables. But the response surface is a good start.

I think it is time that Kobye tells us what software she is using. Is it matlab? And thereby what restrictions there is about what can be computed.

K

#### Kobye

##### Guest
I am using MATLAB and Abaqus. I use MATLAB to generate a python script, then call Abaqus from the command line with this python script. Abaqus then runs (this is the expensive part, it's about 4 minutes per run with theta=0.6, 20 minutes per run with theta=0.4). When Abaqus is done running, I use MATLAB to read the Abaqus output file and extract the desired displacement data.

The MATLAB parts are very cheap, the Abaqus part is computationally expensive (my model is nonlinear and requires a fair few iterations to solve).

I am currently creating the scatter plot matrix, using MATLAB. I will post the results.

K

#### Kobye

##### Guest
EDIT: Correction, I actually have 13 input x-variables, not 14.

Okay, I created the scatter plot matrix. Each and every of the scatter plots follows the uniform marginal distribution from the multivariate 14 dimensional distribution. It therefore does indeed appear that my x-variables are uncorrelated!
The scatter plots arent nicely scaled, but the boundaries are clear from the data themselves. I would like to explain the straight lines along the bottom and right. These are the convergence error row and column, which I included in my splom. I thought maybe it could tell us something interesting?
Also, I've attached the raw data including xi values in a text file.

Also, note that my range of values for each xi don't range from 0-1 as I've been using as an example so far. That's clear given the axis limits on the scatter plot matrix. But they are still uniformly distributed in the hypercube. They can easily be normalized to the unit hypercube.

I will now proceed to try and plot different values of e in different colors.

Last edited by a moderator: