Test selection for a ‘marks out of 10’ quiz: Binomial, Fisher, meta-analysis, or ?
I have been testing a type of depth of anaesthesia monitor. It involves listening to the EEG (i.e. it’s an electroencephalophone). It’s very much in the pilot stage. I managed to get 23 volunteers to undergo a period of training, then I tested them on 10 random 5-second samples of ‘awake’ vs ‘asleep’ sound and asked them to say whether they thought the patient was awake or asleep. The null hypothesis is that my monitor is no better than guesswork. So I have 23 sets of marks out of 10. The vast majority of people (21) scored >5. (In fact 16 out of 23 scored 8,9, or 10 out of 10). What’s the best way to present this data (Other than a histogram)? I have looked at my basic stats book/the internet and considered the following, but each seems to have something against it:
1. Simple binomial table. Since only 2 people scored less than 5, my lookup table gives a p-value of <0.001. But can it really be that simple?
2. T-test – To do this, I assumed that there was another group of 23 subjects who all scored 5 out of 10, then did the test assuming 1 tail and 22 DF. But there isn’t another group, and my data is not continuous, and the data from the test group is markedly skewed on the histogram. So is this test valid at all?
3. Meta-analysis. I assumed that each subject was 1 trial in a meta-analysis. From each subject’s 2x2 table, I did ad/bc to get an effect size, weighted each trial using inverse variance weights, then combined all the trials to get a Z-statistic. But since some cells contain zero, to get ad/bc I had to ‘give’ half a mark out of another cell (e.g. if the subject got 10/10, this made the table 5,0,0,5 which I changed to 4.5,0.5,0.5,4.5). Incidentally, on no occasion did this ‘benefit’ the alternative hypothesis as the only zeroes were in ‘wrong answer’ cells. Is this approach valid?
4. Hotelling’s test. A colleague suggested this. I don’t really understand it, but my impression is that it is similar to a t-test, but one can consider each subject’s score individually in some kind of multiple analysis.
5. Chi square on the combined 2x2 table from all subjects: this seems wrong because there are not 230 independent scores, but 23x10 scores.
6. Fisher test on each subject: but what do I do with the 23 p-values I get?
Or is there some other way to approach this that I haven’t thought of?