If you understand the response by region , you could sample less densely in the flat region and more densely in the regions where you see more variability thus keeping the number of calculations to an acceptable level.
Well you said that your x-variables were uncorrelated. You need to check this. Print the correlation matrix for x1, x2, ...,x14. Also, it is always nice to do graphs, so a scatter plot matrix (SPLOM) is always nice to have a look at. If they are uncorrelated then the two dimensional plot of (xi, xj) will be like a marginal distribution from the multivariate 14 dimensional distribution. Thus, you can plot x1 versus x2 and have different colours for the size of e. So such a graph can have a value. (It would be nice if you posted it here.) It will be many pairs (14*13/2 =91) but plot say, five or ten.
For each of your 300 pairs of residuals (ei) compute the sample variance s^2. To not let it vary to much, take the log, like: lv = log(s^2) where lv is "log-variance". Then you will have 300 "observations" of lv.
If it is true that the "variability" is constant over the whole region of x1,...,x14, the you expect lv to not be influenced by the x-variables. So estimate the regression model:
lv = b0 + b1*x1+b2*x2+ ...+b14*x14 + b11*x1^2+ b22*x^2+ ...+b1414*x14^2 + residual
plot the curve for: lv=b0 + b1*x1+b11*x1^2
You expect the bii to be positive so that the curve bends upwards towards the edges. If any of the b-terms is significant (except for b0) then you have evidence that the "variability" is not constant, so that there is something to gain.
Omitted in the above model is all the 91 pairs of interaction (like: b37*x3*x7+...+b59*x5*x9...). If the main effect or the squared effect is significant then include that interaction effect.
(If you think it is more natural you can centre the variables, like: x1new = x1-0.5, since 0.5 is the centre and mean in the design.)
The above regression model, the "response surface" as it is often called, can be thought of as a multidimensional Taylor series expansion, where the bij-coefficients estimates the derivatives.
One could try to reduce the problem to a one dimensional problem by computing the distance from the centre. Let r be one points distance from the centre: r^2 =((x1-0.5)^2+(x2-0.5)^2+...+(x14-0.5)^2)Well, there's a good chance that the extreme values are caused by the data from the edge of the unit square.
Then plot the residuals ei versus r, and si^2 vs r, and lvi vs r, and do the regression: lv = a0+a1*r+a2*r^2 + rest
You can also do histogram for values r>0.75 say and another for those where r<0.75, Or slice it up for every r 0-0.10, 0.10-0.20... and do boxplots and histograms for that.
Still I don't accept to take the percentage of or residuals, like in ei/Yi. (You can do that at the last step as a description.) Suppose if the variation is not constant for all Y.
It can be good to plot ei versus Yi and look if there is any pattern. Increasing variation for larger Y, thus heteroscedasticity. A standard plot is to plot the lv versus Y, to check for heteroscedasticity.
I agree. But right now Kobye has got 300 pairs that is supposed to say something about the 3500 values (the "population") I guess she is not going to be allowed to, and get the money for, to run additional runs right now. I guess that first it is needed evidence for where the results are accurate and where it is not. Later on maybe other methods can be used, including varying the theta parameter.
A more modern method to the response surface could be non-parametric smoothing like generalised additive models (gam). That would estimate a smoothed curve among the 14 x-variables. But the response surface is a good start.
I think it is time that Kobye tells us what software she is using. Is it matlab? And thereby what restrictions there is about what can be computed.
I am using MATLAB and Abaqus. I use MATLAB to generate a python script, then call Abaqus from the command line with this python script. Abaqus then runs (this is the expensive part, it's about 4 minutes per run with theta=0.6, 20 minutes per run with theta=0.4). When Abaqus is done running, I use MATLAB to read the Abaqus output file and extract the desired displacement data.
The MATLAB parts are very cheap, the Abaqus part is computationally expensive (my model is nonlinear and requires a fair few iterations to solve).
I am currently creating the scatter plot matrix, using MATLAB. I will post the results.
EDIT: Correction, I actually have 13 input x-variables, not 14.
Okay, I created the scatter plot matrix. Each and every of the scatter plots follows the uniform marginal distribution from the multivariate 14 dimensional distribution. It therefore does indeed appear that my x-variables are uncorrelated!
The scatter plots arent nicely scaled, but the boundaries are clear from the data themselves.
I would like to explain the straight lines along the bottom and right. These are the convergence error row and column, which I included in my splom. I thought maybe it could tell us something interesting?
Also, I've attached the raw data including xi values in a text file.
Also, note that my range of values for each xi don't range from 0-1 as I've been using as an example so far. That's clear given the axis limits on the scatter plot matrix. But they are still uniformly distributed in the hypercube. They can easily be normalized to the unit hypercube.
I will now proceed to try and plot different values of e in different colors.
Last edited by Kobye; 05-01-2014 at 09:24 AM.
OK, that looks nice.
But it would also be interesting to see say the plot x1 versus x3 where you have marked different colours for what I called residual ("accuracy"). Dark blue: ei < -40, light blue: -40< ei < -10, pale blue: -10< ei < 0,...., red: ei >50
Or if you colour them for absolute values for residual, so that large deviations in absolute values scream out in red that here it is deviating severely. Maybe that happens along the edges.
EDIT 2: I've attached a pdf of the color-coded scatterplot matrix image to this post.
Took me a little while as I had to write a custom MATLAB code for this, but the results are here.
This is the key:
Color | Absolute Error
Red | >30%
Orange | 30% - 20%
Yellow | 10% - 20%
Green | 10% - 5%
Light Blue | 1% - 5%
Dark Blue | <1%
Here is the image:
Full size, high quality image is attached or available at this link:
http://postimg.org/image/w3tjbaogr/full/
I've been thinking about it and I think you are right that I should just use the residual ei and not ei/Y1. I don't think this will eliminate the outliers, but I agree it's better practice.
Let's say I report that displacement error is on average 0.002mm and variance is 0.0001mm. I can always then say, normalized on the mean displacement, this represents a 0.5% mean error.
Interpretation:
The only trend that I can see from these scatter plots, is contained in column 11.
For all the scatter plots in this column, the red dots occur in the region near the low end of the range of x11.
If I were to neglect the ei's for all x11<0.25 (1/4 of its normalized range of 0<x11<1), I would eliminate the red dots from the data.
Is it reasonable to suppose that if I rerun all of the data with x11<0.25 with a finer mesh (that's a quarter of the total data), then I should be able to suppress the really grave errors (>30%) from my data.
Last edited by Kobye; 05-01-2014 at 03:19 PM.
Any chance you could attach the image as a graphic. File sharing sites are blocked here.
I've attached it. Sorry about that!
Hmm. It's trying to direct me to an online storage site that is blocked.
It would be interesting to see what the regression estimates are from the response surface model.
Let's not over interpret the SPLOM. After all we expect something "unusual" in 5% just because of chance on the images and there are many images there. Maybe it was a little bit naive of me to believe that one could find anything in that kind of graph, since there are so many x-variables. With, say four variables, it would have been more clear. But it is always good to also have a graphical look at the data! Now the regression seems more promising.
I've noticed something strange about a few of the outliers. It is related to the way MATLAB preprocesses my model. There may be an error in my code for just the outlier meshes.
I am going to investigate this and see if it holds up and affects any other observations.
I will report back in the next day or so.
Tweet |