Just joking but they need to come up with the FU distribution for the extreme mean peaks. This looks like the cover of a kid rock album.
I have a computer model, which I need to run 3500 times with different input parameters.
The model's accuracy depends on a constant, let's call it T, which unlike the input parameters doesn't change between model runs. By decreasing T, the model's accuracy improves but the computational cost increases.
I have determined by trial-and-error that a certain value of T gives me a good compromise in terms of accuracy and computational cost.
In order to show that T gives me acceptable accuracy for all 3500 model runs, I have performed a statistical analysis on a sample of 300 randomly selected runs of the total population of 3500 runs (as an aside, the different inputs for the model runs are uniformly distributed in the unit hypercube).
I performed each of these 300 model runs twice, once with the aforementioned T and once with a slightly smaller value. By taking the difference in model output for each run, I was able to calculate a convergence error for each individual.
Please see the results in the below histogram. I have also attached the raw data in a text file.
The statistics for the data are as follows:
Mean = -0.31079
Median = 0.0196
Skewness = -1.1903
Kurtosis = 44.0934
The kurtosis value is really high. I don't know what type of distribution best describes this data really, I haven't seen anything that looks much like it.
When I try and fit different probability distributions to this data in Matlab, it seems that the Logistic distribution gives me the best R2 value. However, it still leaves a lot to be desired, when I qualitatively compare the two curves. At least it's better than the normal distribution which this data is clearly not (see below).
What type of distribution should I use? Ultimately, what I am trying to do is to work out the population mean and variance based on my sample statistics. Then I would like to say something like, "based on this mean and variance, we conclude that 95% of the model runs have a convergence error of less than 5%. This justifies our choice of T."
I can qualitatively see from the histogram that this is true at least for the sample.
So, is what I am trying to do sound statistics, or should I use another approach?
Your thoughts/help is highly appreciated!
Last edited by Kobye; 05-01-2014 at 04:36 PM. Reason: Added improved figures.
Just joking but they need to come up with the FU distribution for the extreme mean peaks. This looks like the cover of a kid rock album.
Stop cowardice, ban guns!
hi,
how about using a non-parametric approach? IMHO there is no point in using a sample here, as you are really throwing away information you already gathered .
i would just calculate the span between P5 and P95 from all data points and report that number.
regards
rogojel
Thanks for your input rogojel. I've only performed the convergence analysis for 300 of the model runs. As mentioned, this required me to perform each of these runs with given input parameters twice. The rest of the 3500 model runs, I ran only once each so I don't know their convergence error.
The type of computer model I am talking about takes about half an hour to run. It is in my interest not to have to run it another 3200 times for the entire population.
Last edited by Kobye; 04-30-2014 at 07:50 AM.
How were the data collected? Was it in time sequence or did you sort it? If it is in time sequence, it is monotonically increasing along a very symmetrical curve that appears very artificial.
EDIT: The raw data in the attached text file is sorted.
The data was collected as follows:
My computer model takes as input a set of 2 independent variables (in reality there are more, but let's assume just 2 for simplicity's sake).
i.e. if my model is f(x) then x = {a,b}.
0<a<1
0<b<1
{a,b} is sampled uniformly on the unit square, so if you imagine a on the x-axis and b on the y-axis, my input space looks like:
So effectively I now have 3500 input vectors, {a,b}, each dot on the image above representing one such vector.
From these 3500 input vectors, I choose 300 by sampling uniformly at random, without replacement.
For each of the 300 input vectors in the sample, I calculated a convergence error, which is very costly to do (many hours of CPU time). This is in effect my data.
I would now like to use what I know about the accuracy of the model for each of these 300 input vectors, to say something about the accuracy of the model for the entire population of 3500 input vectors.
Last edited by Kobye; 04-30-2014 at 07:54 AM. Reason: Grammar correction
I just found that the Laplace (double exponential) distribution gives me a much better fit.
Would this be a reasonable distribution to use? The only thing that makes me hesitate is that my skewness, at -1.1 is less than -1 as is typically recommended for a symmetrical distribution.
I would map the extreme highs and extreme lows on the unit square in different colors to see if specific zones cause the extreme values.
Hi Miner, thanks for your input. Well, there's a good chance that the extreme values are caused by the data from the edge of the unit square.
Could you kindly explain to me how this knowledge might aid my statistical analysis?
At the end of the day, my population of 3500 is fixed. I generated it once using well respected methods, and it is what I have to work with. I am actually very interested in the data points on the edge of the unit square, as this work is part of a global sensitivity analysis.
Could it be that when you create your variable you have created a t-distribution? If your formulas are inspected (sorry I didn't immediately understand your description above) then is it possible that you derive a t-distribution?
O could it be that you are dividing one normal distribution with an other? That would give you a Cauchy distribution?
It looks like from your first diagram that there are heavy tails, more heavy than the normal distribution.
The t-distribution and the Cauchy distribution have heavy tails, the lower the degrees of freedom the heavier tail. In fact the Cauchy is a t-distribution with one degrees of freedom.
Also you could estimate a power exponential distribution. That have the Laplace distribution (also called the double exponential distribution) and the normal distribution as special cases.
Both can be estimated with the gamlss package in R. (Also a skewness parameter can be estimated there.)
- - -
Rereading the OP: Maybe this is just a part of the whole investigation that you really want to do with the simulation. It is better to tell the whole story than just trying to solve a small part of it.
Hi Greta, fantastic response! Well, although I'm not sure it's relevant to the topic, I am using the extended fourier amplitude sensitivity test (eFAST, Saltelli et al, 1999) to calculate the sensitivity of my model to its inputs. In order to do this, I need to run the model for a number of different input vectors, as I have described.
The following is an aside covering the sensitivity analysis (SA) that I am performing, not the separate statistical analysis that is the topic of this thread.
In the SA, the way the input vectors are chosen depends on the number of input parameters and number of runs that the user can reasonably perform, and is done so that it fills the unit hypercube in an optimal way. This allows us to truly explore the design space. Let our design space be the unit square. This 2-dimensional space can be mapped to a mono-dimensional space (i.e. a curve) that, by Weyl's theorem, drives arbitrarily close to any point x (i.e. input vector) of the input domain. This transformation is done at a different frequency wi for each input variable xi. For the 2-dimensional domain, we take w={11,21}. Various summary statistic for the entire space can hence be calculated by integrating this curve (now a mono-dimensional rather than multi-dimensional integral!) The calculated statistics show different periodicities associated with each input variable's frequency, wi. This is the basis for computing a sensitivity measure.
I think your question asking how I came up with the 3500 input vectors could be relevant to my statistical analysis, however.
The space-filling curve s extends from 0 to 2Pi in steps of 2Pi/Ns where Ns is the sample size. Although for my actual model (which has more than 2 inputs) it's 3500, in the example below I've used Ns=800.
For each point along this curve, I calculate {a,b} i.e. x(s)={a(s),b(s)}.
The formula by which a, b (in other words x1 and x2) are calculated is given in the eFAST paper and is:
xi(s)=1/2 + 1/Pi *arcsin(sin(wi*s + phi) where phi is a phase shift parameter, randomly sampled in (0,2Pi]
I hope this adds something to the discussion! I wouldn't know whether this method of sampling causes a t-distribution.
Last edited by Kobye; 04-30-2014 at 04:50 PM.
I followed up on your advice and tried fitting a t location-scale distribution. This seems to fit very well indeed. It also alleviates my skewness and kurtosis concerns!
It is nice to hear that it fits well! But it can't make the kurtosis to disappear. If it is in the data it will still be there.
Kobye you said you had a "convergence error %". That percent thing looks worrying. If you divide it with something that is very small sometimes, you will get big numbers. I would prefer to not use percentage. It is customary in statistics to use actual values.
You have used a variable you called T, maybe a threshold or something. I would prefer to call that theta, like a statistical parameter.
From your sample of 300 you have done two experiment (or computations) where the response is Y1 and Y2:
Y1 = g(x1, x2; theta) and Y2 = g(x1, x2; theta+delta)
From this you created a residual: e = Y1 - Y2
(You called that convergence error.)
Miner said:
I agree! That could give a clear visual impression if the "spread" of the residuals varies over the x1, x2 space.I would map the extreme highs and extreme lows on the unit square in different colors to see if specific zones cause the extreme values.
And Kobye agreed:
If it would be so that the kurtosis or skewness or variance is different over different regions of x1 and x2 then that good to know, and important for evaluating the quality for of the larger group of 3500.Well, there's a good chance that the extreme values are caused by the data from the edge of the unit square.
in gamlss it is possible to estimate how the mean, variance, skewnes adn kurtosis is influenced by variables like x1 x2. You have a distribution suggestion, the t-distribution or power exponential, that seems to work.
I don't know if your language is using the expression 'bathtub curve', something that is very flat in the middle but rises steeply at the edges. Maybe the variance (or kurtosis) is like that here.
Tagushi design: I am not sure if I agree that theta should be kept constant. Or delta. Maybe there is room for a robust construction design here. More about that later.
Again, a very high quality reply. You've also made it clear to me what Miner meant before. That would sure be interesting to check.
I certainly agree that kurtosis is inherent to the data and is not affected by what type of distribution I decide to use. What I meant was that for a t-distribution (and I'm assuming this is the same for a t Location-Scale distribution, but I should check), if nu < 1 (as it is for my data) then skewness and kurtosis are undefined. So I'm no longer worried that the kurtosis is 41 when e.g. the Logistic distribution is supposed to have a kurtosis of 3. Maybe this is a foolish way of thinking.
I also agree that percentage isn't a great random variable, but let me motivate why I am using it.
My model is basically a numerical code that finds an approximate solution to a boundary value problem involving several partial differential equations. The accuracy of the solution is affected by the discretization of the domain... a finer mesh will cause the solution to converge to the true value, but can severely impact computation time. So my tuning parameter (I previously called it T but lets continue with theta) is really just the average mesh size.
The model output that I'm considering is the displacement of a structure at a certain point in the domain. In my field of computational mechanics, we define a convergence error as (Y2 - Y1)/Y1... when this convergence is sufficiently small, typically less than 5%, we consider the solution to be reliable.
So the reason I am reporting a percentage is because this is the established way of demonstrating that the model is converged. I have checked the raw data and there are no obvious suspiciously small values of Y1 that could be causing extreme values in my data.
Ideally I would have a closed-form model and I wouldn't need to worry about convergence or whether I've made a good choice of theta. In that case, as you put it;
Y = g(x1, x2) -- eqn.1
However, my domain is irregular so it doesn't have a closed form solution. I must use approximate numerical techniques to solve. As a result, I have
Y = g(x1, x2; theta) -- eqn.2
By trial and error and a good dose of conservatism, I chose a value of theta that should provide me reliable results for practically all of my runs. This will allow me to keep theta fixed and effectively treat eqn. 2 as though it were eqn. 1. What I'm trying to do now, I suppose, is use estimation theory to justify that the vast majority of my model runs are reliable. The data seems to be showing me this, although I am open to other interpretations.
Although... As I mentioned my model actually has more than 2 inputs -- 14 to be precise. These are uniformly distributed in the unit hypercube.
However, I can't plot a 14-dimensional cube... How could I visually or otherwise inspect which regions of my input space are causing the few outliers (convergence errors >5%)?
This would be interesting to know, because there are only very few of these and if I know in which regions they occur, I could rerun just these models with a finer mesh and virtually eliminate error in my experiment.
Tweet |