P-values for leave-one-out accuracy (on imbalanced label set)

Dear board members,

I am happy to have found this very nice community :) I have a question and I hope that someone can help me.

I want to evaluate a threshold-based classifier in the leave-one-out setting. The classifier assigns samples to two different classes, depending on whether or not a certain numerical feature of the sample exceeds a previously learned threshold.

Our aim is to distinguish between organisms that have a certain ability (phenotype+ class) and those which have not (phenotype- class). Our labeled data set is comprised of 150 organisms, splitted into 50 phenotype+ and 100 phenotype- organisms. Therefore the set is imbalanced.

Let's say the classifier made 140 correct classifications for the 150 held-out test samples in LOO. I wanted to use a binomial test for computing a p-value for observing this result, assuming a random choice behaviour of the classifier, i.e. a success probability of 0.5.

I used the R-method bin.test(#successes, #trials, probability) as follows:

bin.test(140, 150, 0.5, alternative="greater")
But I am not sure if this is correct, because of the imbalance of the label set. Does the difference in size of the two classes matter? Obviously, a classifier that prefers to predict the phenotype- class would yield a better success rate.

An alternative would be to define the success probability as 50 / 150 = 0.333, i.e. the fraction of the phenotype+ samples in the set. However, this should also be false, because it is also a success when the classifier correctly identifies a phenotype- organism.

How can I improve the approach? I hope someone could help me. In case something is unclear, please ask me.

Best regards
Last edited: