Review Methodology


I'm about to embark on some mutivariate regression analysis for my company but it's been a while since I've used these methods. The outputs will eventually be audited and the work is likely to pass to others once I've finished so I thought it best to outline my methods as simply as possible using an example of only one independent variable and one dependent variable Volume/Unit Cost. I will be adding more independent variables for the real analysis.

Essentially, I want to ensure my use of the correct stats and interpretation of the results and method are sound. I would really appreciate some critical analysis of the attached from some 'experts' to ensure it makes sense and is statistically sound.

Many thanks if your willing to read,


Ambassador to the humans
Can you post what it contains? Or describe it? I just don't really trust downloading random attachments from people who are new to the forum.
Sure.. it's an excel file (no macros) that has some analysis using excel LINEST function ,charts and an explanation of what I'm doing. I understand your concern and give you my word it is not dodgy in any way. If you can recommend a safer method to display the contents I'm happy to oblige
This is a straight copy/paste from excel but not very ledgible....

Method to determine if derived model(s) are statistically significant. Pressing F9 will present a random model and the stats in 9. below will determine the statistical tests for the model. (Yellow cells can be changed)
Essentially this allows a pictorial view of what the data/model looks like on the charts for derived statistical test figures. The same method can be used to add additional structural factors in an attempt to improve the model accuracy
1. Define an arbritray model of form Unit Cost = mV-c where m & c are constants. The unit cost is the dependent variable (predictor) and volume is the only independent variable in this example
2. The base model can now be used to create arbitary data of the same form but with a random variation to simulate typical project variance. Introduce a 'random' variance for each data point from the model and re-plot on chart (Press F9 to randomize data)
Effectively this is doing things in reverse to generate dummy data. In the real world we start with real data and generate a best fit model. The methods used here can be used to test validity of model and show how the values look visually
m c Randomize %
1000 -0.5 90.00% Volume Unit Cost Randomize UC
reducing randomize factor to 0% will adhere to base model 100 100.00 168.40
200 70.71 106.99
250 63.25 74.63
500 44.72 15.74
Defines fixed base model on charts based on 12 data points 750 36.51 25.34
which fit above equation 1000 31.62 3.45
Introduces random variation for each of the 12 data points 1250 28.28 31.34
on the chart to create dummy data with same underlying base model 1500 25.82 12.11
1750 23.90 37.67
2000 22.36 40.07
2250 21.08 24.50
2500 20.00 2.36

3. Transform data to allow linear regression by taking logarithm of both Volume and Unit Cost and identify key stats through linear regression. (either log or ln can be used).
The top chart represents the base data, the chart below represents the transformed data which is more linear
4. Use excel function LINEST to calculate key statistics of regression for both fixed model (black) and random variation (pink)
5. Key stats output by LINEST function in excel are outlined below Key Stats Key
Ln Volume Ln Unit Cost Ln Randomize UC Statistic
Key Stats - Base Model (transformed) See Key Stats Key 4.61 4.61 5.13 se1,se2,...,sen
-0.50 6.91 mn mn-1 5.30 4.26 4.67 seb
0.00 0.00 note error = 0 & R2 = 1 as sen sen-1 5.52 4.15 4.31 R2
1.00 0.00 perfect fit for base model R2 sey 6.21 3.80 2.76 sey
2.38193E+31 10.00 F df 6.62 3.60 3.23 F
3.07 0.00 ssreg ssresid 6.91 3.45 1.24 df
7.13 3.34 3.44 ssreg
7.31 3.25 2.49 ssresid
Key Stats - Randomized Model (transformed) 7.47 3.17 3.63
-0.80 8.55 mn mn-1 7.60 3.11 3.69
0.28 1.91 sen sen-1 7.72 3.05 3.20
0.44 0.99 R2 sey 7.82 3.00 0.86
7.942 10.00 F df
7.80 9.82 ssreg ssresid

6. The value R2 quantifies goodness of fit. It is a fraction between 0.0 and 1.0, and has no units. Higher values indicate that the model fits the data better. Eg an R2 of 0.7 can be interpreted as 70% of the variation of the data from the model can be explained by the independent variables used to construct the model, or 30% of the variation is unexplained by the model. Don't make the mistake of using R2 as the main criterion for whether a fit is reasonable. A high R2 tells you that the curve came very close to the points. That doesn't mean the fit is "good" in other ways. The best-fit values of the parameters may have values that make no sense (for example, localised clusters) or the confidence intervals may be very wide. As can be seen in this example, sometimes low R2 values still produce good models
7. The F statistic is used to determine whether the observed relationship between the dependent and independent variables occurs by chance.
The significance calcuates the probability of a higher F value occuring by chance, therefore, the lower the better as less likely to have occurred by chance.
8. A critical F stat can be calculated using Alpha and degrees of freedom in the data being analysed. If the Fstat is greater than the critical Fstat value
then the model is statistically signifcant within the desired probablity acceptance
const = TRUE const = FALSE
v1= 1.00 2.00
v2= 10.00 10.00

9. A good model will have significance lower than alpha, F stat greater than the critical F and the higher the R2 the better
Significance F 0.018
Critical F 4.965 Critical F Stat based on alpha of 0.05
Fstat 7.942 i.e. Critical F will be calcualted to ensure probability of 95% that there is a relationship
R2 0.44
A good model is where Fstat > Critical F, a higher R2 value is better, a lower Significance F is better
based on an alpha of 0.05 Significance F is lower than alpha

The alpha level, the significance level of a hypothesis test, is upper bound for Type I errors and if the p-value (Significance F) of the test is less than the alpha level we reject the null hypothesis
Critical values are used in fixed level testing. If the test statistics falls within the reject region we reject the null hypothesis
Degrees of freedom tells us how many piece of information we need to know in order to completely define the situation. for example, if we know that four numbers between 0 and 1 must sum to a value of 1 we have three degrees of freedom. If we know three of the values we can find the fourth.
Eg. Consider the hypothesis as a trial against the null hypothesis. the data is evidence against the mean. you assume the mean is true and try to prove that it is not true. After finding the test statistic and p-value, if the p-value is less than or equal to the significance level of the test we reject the null and conclude the alternate hypothesis is true. If the p-value is greater than the significance level then we fail to reject the null hypothesis and conclude it is plausible. Note that we cannot conclude the null hypothesis is true, just that it is plausible
10. "To apply this method - run the modelling with all data. And compare stats in 9.
Remove outliers / errors in data and retest. If tests in 9. do not improve perhaps move to next hierarchy where data is more specific to material / solution. If tests do not improve either data is not sufficent / not accurate / requires other factors / or increase Alpha (cell G52) to allow less confidence & greater variability/errors. Alternatively reject the model and use the mean or an alternative data source."