# Sample size for equivalency testing with multiple 'samples' per patient

#### skfrabel

##### New Member
Hello everyone,

I have a question regarding sample size calculation for my research study and was not able to find an answer elsewhere.

The study compares two imaging methods and should prove that one method is equivalent to the other method in terms of geometric measurements.

The question: Each patient will provide multiple available 'samples'. Let's say the study investigates the accuracy of height measurements performed on vertebral bodies of the thoracic spine on both imaging method (paired data). The hypothesis is that both methods will be equal. Each patient has 12 thoracic vertebral bodies, so in theory 12 measurements could be made per patient.

In this case, is it valid to consider each vertebral body as sample? To me, it makes no sense to consider each patient as a unique sample as every patient yields so many 'samples'.

Thank you very much for your help!

#### katxt

##### Well-Known Member
What will be your criteria for "equivalence"?

#### skfrabel

##### New Member
The equivalency margin will be defined as +- 1mm.

#### katxt

##### Well-Known Member
What does that mean statistically?
That you are 95% sure that the mean difference between the methods is less than 1 mm? or ...
That the difference on any occasion will be less than 1 mm, 95% of the time? or ...
Something else?

#### skfrabel

##### New Member
I am 95% sure, that the mean difference of all measurements between the methods is less than 1 mm.

For example, if I perform a sample size calculation, defining an alpha of 0.025 and power of 0.95 with a SD of 0.7 and equivalence limit of 1.0, I end up with a required sample size of 16 per group.

However, I can perform several measurements on different anatomical equal 'samples' per patient. If I take 16 patients, each having 12 anatomical equal samples, I will perform 192 measurements per group.

Is it reasonable to consider each of my measurement (in my case thoracic vertebral bodies of the spine) as a sample or is that not valid?

#### hlsmith

##### Less is more. Stay pure. Stay poor.
The anatomic bodies within a patient are not independent. You need to control for them being clustered in patients. So you will need to run a multilevel model for your hypothesis. The reason is they are correlated, similar bone densities within patient, same radiographic angle, same appearance of contrast in coloring given positioning of patient/machine, etc. You will likely have to use unstructured covariance structure, since AR1, etc. seem irrelevant.

How are you currently calculating power? Seems like a simulation based sample size calculation would be needed. Also, I did not follow you saying you may repeat the process for other measures. Also, if one of the approaches is not the gold standard they may be equivalent but both wrong. Lastly, who is doing the readings? Will you also need to control for this, if it is not a machine but readings clustered in adjudicators/readers?

#### katxt

##### Well-Known Member
I am 95% sure, that the mean difference of all measurements between the methods is less than 1 mm.
Sorry. I misread your reply. The last post referred to the other alternative.
Basically what you are looking for is for a large enough sample to ensure that a 95% confidence interval for the mean difference lies entirely between -1 and 1. (If you look up TOST, you will find that you can actually use a 90% CI.) The sample size needed depends on how close to 0 the mean difference is.
Is it reasonable to consider each of my measurement (in my case thoracic vertebral bodies of the spine) as a sample or is that not valid?
hlsmith's comments about the dependence of the measurements within one subject are fair, but to a great extent between subject variation is virtually eliminated if you only deal with differences. You have to use some fairly fast statistical footwork to think up ways that the differences from the same subject are dependent (although I'm sure some suggestions will be forthcoming.)

#### hlsmith

##### Less is more. Stay pure. Stay poor.
@katxt - my intuition is that you still use 95% CI if you are looking at equivalency - but if you are looking at non-inferiority you can use 90%- because you only care about one-side.

#### katxt

##### Well-Known Member
There is a little debate, and 90% is quite a slippery idea, but my intuition is that it is just counterintuitive.
Anyway, it isn't really a 90% CI. That is just a convenient way to show the interval where your 95% non-inferiority and 95% non-superiority intervals are both true simultaneously.

Last edited:

#### katxt

##### Well-Known Member
For example, if I perform a sample size calculation, defining an alpha of 0.025 and power of 0.95 with a SD of 0.7 and equivalence limit of 1.0, I end up with a required sample size of 16 per group.
It isn't clear what you have done here, but it is probably worth while making sure that your calculation is for the right thing. Apologies if I have got things wrong. You may have a method for equivalence that is perfectly valid. If so, I would find a reference very useful.
Showing equivalence is not simply getting a "no significant difference" on a t test between the two sets of lengths. "No significant difference" is just a face saving way of saying "we don't know if there is a difference or not". We want to show beyond reasonable doubt that the two means differ by less than 1 mm.