PhD – Power calculation

#1
Hi all, first post, be gentle.

Im doing a PhD (just starting out) and writing my Project Approval Form.

As part of that, my Prof wrote on a draft “n=?”

Now, out of my four research objectives, only really one has quant in it.

I am developing a test for novice users, so one of my objectives will be to test the novice users against the expert users and look at their levels of agreement.
This is the only place where I can see the need for “n=?” i.e. how many novice and experts need to agree to make it powerful enough to have any meaning.

The thing is…I don’t have a hyp. It’s not a question with a hyp and a null.

I think the power calculation requires a hyp and null? Do I make the hyp up (i.e. there will be agreement between, with the null being there will be no agreement?)
I am lost.

Anyone help?:cool:
 
#3
As in what type of data?

Essentially, I want to see if two groups (experts and novice) give the same scores to participants who complete a movement screen.

The scores for the screes will probably be based on a criteria of 1= fail, 2= minor faults, 3= no faults.

These screens are already out there in lit. However, as I'm wanting to get end user buy-in, part of my research is using end users to help select and/or modify the published screens.. And as such, I don't know what screens I'm using yet and thus I don't know what type of data (ordinal, scale etc) i will be comparing.

That makes sense?
 

obh

Active Member
#4
Hi Burnsie,

It doesn't really matter what will you check exactly ...what about the following null hypothesis:
H0: No different between experts and novice
H1: There is a different
(or maybe one tail instead ...)

So you are going to compare 2 groups of users, on several tasks?
how many tasks approximately?

And the mark on each task will be: 1= fail, 2= minor faults, 3= no faults.
Is it possible to give more values like 1 to 5?
 
#5
Hi Burnsie,

It doesn't really matter what will you check exactly ...what about the following null hypothesis:
H0: No different between experts and novice
H1: There is a different
(or maybe one tail instead ...)

So you are going to compare 2 groups of users, on several tasks?
how many tasks approximately?

And the mark on each task will be: 1= fail, 2= minor faults, 3= no faults.
Is it possible to give more values like 1 to 5?
I would guess betwen 3-5 screens (trying to keep it quick and simple).
However, the scoring systems will be unknown until it is decided which ones to use.

But yes, two groups on several tasks, looking for agreement.

I "want" agreement. So might flip the Hyp and the Null... but I suppose you'd have to suggest there should be a difference given their levels of training, so maybe, you are correct there
 
#8
Yes, so, Novice 1 might look at 3-4 different screens and give each a score (maybe 1-3, maybe something different…that’s an issue in my head)

All Novice do the same.

The Expert will do the same.

All on the same person doing the screening task (athlete).
 

obh

Active Member
#11
You may start with the two-sample t-test (64 each group to identify a medium effect or 26 to identify a large effect)When you know your way you may be more accurate
 

obh

Active Member
#13
Hi Burnsie,

Using the power for the two samples t-tests, I used an equal standard deviation assumption, you can use unequal if you have a better assumption.

I would probably use at least 30 if possible, so you could use the t-test in most cases (CLT), but it mainly depends on the effect you want to identify.
Actually you don't really expect that novice and experts will have the same result ... so maybe just create a confidence interval for the results.
In this case, the sample size will depend on the MOE you want to achieve.

You may use the below R code or http://www.statskingdom.com/sample_size_t_z.html (the same results)

Code:
library(MESS)
results1=power_t_test(delta=0.8, sd=1, sig.level =0.05, power=0.8, ratio=1, sd.ratio=1, type="two.sample", alternative="two.sided", df.method="classical")
results2=power_t_test(delta=0.5, sd=1, sig.level =0.05, power=0.8, ratio=1, sd.ratio=1, type="two.sample", alternative="two.sided", df.method="classical")

> results1=power_t_test(delta=0.8, sd=1, sig.level =0.05, power=0.8, ratio=1, sd.ratio=1, type="two.sample", alternative="two.sided", df.method="classical")
> results2=power_t_test(delta=0.5, sd=1, sig.level =0.05, power=0.8, ratio=1, sd.ratio=1, type="two.sample", alternative="two.sided", df.method="classical")
>

> results1

     Two-sample t test power calculation

              n = 25.52463
          delta = 0.8
             sd = 1
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

> results2

     Two-sample t test power calculation

              n = 63.76576
          delta = 0.5
             sd = 1
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group
 
Last edited:
#14
Hi.

Firstly thank you for taking the time with this problem. Sorry for so many questions. Not sure how most posters on this forum work... whether they just want the answer and then sod off, or, whether they want to know how/why the answer is what it is so they can solve it on their own in the future.

Unfortunately for you.... I'm the latter

So, understand that
1-my data will probably be ordinal (although may be nominal depending of which screens are selected during the research).
2-my null will be that there is a diff between the groups (due to their experience levels, as you suggest).
3-it will be a two tailed hypothesis/test
4-alpha will be set at 0.05
5-beta for type 2 errors at .8. both are industry standard.
6- ideally I should try and find some similar research where the two groups are compared using similar data i.e. ordinal data. Until then we have to use a standard deviation assumption
7-CI at 95%.

I am confused about your comment in your above post regarding effect size and how if you want to identify a smaller effect you need less participants?? Why is that (is it down to the calculation, such as SEM etc.... Feel free to sign post me)

:)
 
#15
Ps I understand that I need more participant with a smaller effect size as as I do not want the effect size to be undetectable due to the randomness of a smaller population group....

So, my effect size is .8 based on beta?
 

obh

Active Member
#16
Hi Burnsie,

This is a statistics forum, nobody said anything about only one question ... you should ask :)

The effect size for the two-sample t-test is: (Mean2-Mean1)
Standard effect size is: (Mean2-Mean1)/Standard deviation

If for example, the scale is (1,2,3,4,5)
So if the effect size is small, for example, the mean of the novices is 3.1 and the mean of the experts is 3.2 you will probably need a very large sample size to prove the means are not equal, and if the effect size is large, for example, the mean of the novices is 1.4, and the mean of the experts is 4.6 a small sample size will prove the means' difference is significant.
All, of course, depend also on the standard deviation, that is why Cohens defines the Standard effect size. so 0.8 is the common number suggested by Cohen for large effect size.
If you know what effect size you want the test to identify, you may use the effect size (standardized), but when you don't know, or you don't know the standard deviation than just use the Standard effect size.

What is the goal of your research?
To show that experts are better than novices?
to show how much experts are better?

Since we don't really expect that the means of the novices and the experts are identical, you will most likely able to prove significant results if you would take a large enough sample size.
So you may show the test but it is probably more interesting to look at the effect size and/or at the confidence intervals.

PS, You can also look at: http://www.statskingdom.com/doc_pvalue.html
 
#17
The goal of my research is to develop a screening tool that can be used in the real-world by the real-world end-user. Most tools are developed in a lab and don’t really work in the real-world due to a variety of issues such as avavailbe equipment, training, time etc.

This is why I will be using an expert panel to help me decide which screens to try and use in a real-world setting (as they know what goes on!) and is thus why I do not know what screens I will be taking forward at this point in time.

To do this, at one point, I guess I will need to look at the levels of agreement both between raters (i.e. novice vs experts) and within their groups (inter and intra reliability). Ideally, I’d like both novice and expert to agree, which means minimal training could be needed and thus more applicable for real-world use.

One of the most popular screens is a functional movement screen (FMS) which has a variety of individual screens that then make up a whole score. Each individual screen/movement is scored 1, 2 or 3, (you could get 21 points in total).

As you suggested in your post above, if there are only a few possible outcomes (1-3), the chance of detecting a difference would be really hard. I would need a large group.

Now, I’ve looked through my reading and found the journals on the FMS that look at either/or inter and intra test reliability.

  • One uses only 4 raters (x2 novice and x2 expert), but screens 40 participants, using Kappa to analyse.
  • Another uses 6 raters (one group) with 39 participants, looking at the Kriendorff for inter rater and ICC for test-retest from the same rater.
  • Another used 20 rater and only 5 participants using ICC
  • Another used 2 (maybe!) raters on 19 parts using ICC
  • Another used x8 (all novice), across different days, using ICC and MDC.
  • Finally, one uses 4 raters (one group) for 20 participants, using ICC rom repeated measures ANOVA.

To sum up, well and truly lost now! Is it the no. of raters for each group I need to focus on, or the number of screens? For example, if I had 10 raters in each group assessing 10 athletes/participants, but each rater looked at each participant, then I’d get 100 data points from this one screen??????? That should give me enough data points?

(head…meet wall!).
 

obh

Active Member
#18
Hi Burnsie,

Please try not to break the wall or the head :)

So you try to build the best screening tool? what result should the tools give?
If I understand you correctly you don't really want to compare novices and experts, but you try to compare what is the best tool to use?
And you may want to know what is the best tool for novices and what is the best tool for experts?
 

noetsi

Fortran must die
#19
First of n does not apply to any qualitative analysis, you will drive the qualitative people crazy if you do. Second, there are two issues with sample size. First, does it give you enough statistical power. Second, can you reasonably generalize from it. You could ask 10,000 sophomores a survey. You will have lots of statistical power. Can you generalize to the population as a whole that way.

Not in my opinion. But I rarely see this issue raised so it probably won't be in your case. But you should consider it.
 
#20
Hi Burnsie,

Please try not to break the wall or the head :)

So you try to build the best screening tool? what result should the tools give?
If I understand you correctly you don't really want to compare novices and experts, but you try to compare what is the best tool to use?
And you may want to know what is the best tool for novices and what is the best tool for experts?
No, what im trying to do is

1 - get a panel of those in the know to tell me which screens to try (rather than science biy saying it should be this or that)

2 - get experts to use it and see if they get related/correct results

3 - do the same with novices

4 - compare if expert 1 get the same as novice 1 when looking at part1 on screen 1. If they agree it means that anyone can use the screen!

so, it’s working out how many novices and experts I need, and if the no. of parts. effect this, as suggested in my post above, I assume it will??