Here is a description:

I have two ovens to create parts/samples in; Oven A and Oven B.

Because Oven B is far cheaper, I want to determine that the distribution of the resulting parts made with each oven are ‘not different’ to within a determined range. So, I am primarily concerned with the study power, or Beta error, to ascertain that the output is ‘equivalent’. The intended study power is expected to be 0.9, or a Beta error of 0.1 (probability that the ovens are indeed putting out parts that are ‘different’ when the testing shows they are the same – primarily, that Oven B is making inferior parts when the test is not able to detect a difference). In other words, I do not really care about alpha error here – I am more concerned with getting an inaccurate conclusion due to the Beta error.

Samples created in each of the ovens are measured for surface strength property.

The sample data for the process(es) that Oven A makes are parts with a mean strength value of 1000; Oven B makes parts with a mean strength of 900. Both have standard deviations of about 275.

(µ±σ, A: 1000±275; B: 900±275)

I want to detect if there is a true difference in means of 200 between the ovens, which would indicate that the parts made by Oven B may be functionally inferior to Oven A (any difference less than 200, or more precisely below a mean of 800, is not considered important).

Using the calculator here:

http://powerandsamplesize.com/Calculators/Compare-2-Means/2-Sample-Equivalence

and setting to calculate Sample Size (inputting 1000, 900, 275, 200, 1 into the respective boxes) I get a sample size of 164 per group for this comparison for a Study Power of 0.9 or Beta error of 0.1. (left the alpha error rate at 5%); I will refer to this as ‘164 data points per group’ for clarity, and ‘samples’ refers to the actual test samples generated from the ovens.

However, the group running the tests has taken the ‘shortcut’ of testing each sample twice (one measurement on each side of a sample) to generate the required number of data points, on the presumption that those measurements were ‘not correlated’ or are ‘independent’ values. In fact, depending on the test run and sample mixture/composition, the Pearson’s R-squared correlations between the two data point measurements (made on each side of each test sample) range from 0.5 to 0.7 when you plot them (this is the Pearson R-squared value for each of the sets of data for Oven A and Oven B when you plot the pair of data points taken for each group of samples from each oven).

Thus, instead of creating 164 samples in each oven per group, they made only 82 samples, and measured each sample twice (once on each side) to generate the required 164 data points for Oven A and Oven B.

If the correlations between the two measurements made on each sample were <0.1, then they’d very close to being ‘uncorrelated’ and ‘independent’. But, when the measurements made on each sample have moderate to high correlation, the effective sample size or true sample size is less. I want to estimate by how much.

For example, if we assumed that the 2 measurements on each sample were 100% correlated (R-squared = 1.0), then each of the data points for that sample would have the exact same value. In effect, creating only 82 samples per oven instead of 164 ‘undersamples’ the data by one-half. That is the same as only generating 82 actual samples and just ‘repeating’ each data point to get to the proper 164 samples needed for a Beta error of ‘0.1’. Pretty sure this is ‘cheating’, and for fully correlated measurements, I still only have 82 independent data points.

If I put ‘82’ into the calculator linked above and compute what the new Study Power is, I get an actual study power of about 0.5 (0.5066, specifically). (This is updating the page to compute ‘Power’ and inputting the same values, but using ‘82’ for sample size).

Thus, for completely correlated data points (where the data points from each sample are 100% correlated), is it correct that the true study power is now a literal ‘coin-toss’? Note that we are talking about correlated data sampled WITHIN each of the two groups, NOT a ‘paired’ study design where there is correlation between pairs for the groups.

What I am trying to determine is: What is the relation between the R-squared correlation on the measurement pairs made within each of the groups, and how does it impact the Effective Sample Size and overall ‘corrected’ Study Power or Beta error?

So, given the easily measured correlations (Pearson R-sq) between the data point pairs from each group, can I determine the Effective Sample Size and corrected Study Power for these data?

I have not been able to find this on the Web anywhere, but I recall from somewhere back in my old college stats, that a sample size correction follows something like EffSS = SS*(1/1+Rsq), with Rsq being the Pearson’s R-squared coefficient.

Is this correct? Or a ‘close’ estimate? Is there a real formula or citation for this somewhere?

Using this approximation for correlated data point pairs of 0.5, 0.6 and 0.7 respectively, I would then get Effective Sample Sizes of 109.3, 102.5 and 96.4, respectively (using only 82 oven samples per group and the ‘two measurement’ method, instead of the properly powered 164 ‘independent’ samples). This would then yield approximate (or actual) study powers of 0.7039, .06629, and 0.6218 – as opposed to the intended study power of 0.9. (I used non-integer sample sizes to the 1st decimal in the calculator for the Effective Sample Size for each, even though non-integers would not make sense if you used ‘real’ samples).

Please note: I do know how to properly run this type of study with a multiple-point measurement used as a covariate; but that is not what occurred with these data, and I am simply trying to explain correctly what the actual outcome is and impact is on the already acquired data.

Any help here is appreciated!!!