OK, sorry for not being clear. Let me try to use a toy example. Let's say we have a bag with 1 red ball and 200 white balls. The problem is to find the red ball without using our vision (or can't use the color to pick the red ball out). Each ball has many other characteristics (e.g. shape, size, writing on it, etc.), and there are other red and white balls outside of the bag that we can train with. After we've built an algorithm using the red and white balls and their features, we try to rank the balls in the bag based on their likelihood of being red. What do you think is the best way of showing performance?

The way we tried to demonstrate performance was to say we find the red ball X number of times in the top X% of the ranked balls, either based on cross validation or several new bags we haven't trained with. Each bag only has one red ball but can have different numbers of white balls. For example, we applied the algorithm to 10 bags, each bag having 1 red ball and 200 white balls. When we ranked the balls in each bag, and we took the top 10% (20 balls), we got the red ball 4 out of 5 times. So we say we predicted the red ball correctly 80% of the time within the top 10% of the ranked balls. Is there a better way to measure or describe performance? Considering we only have one true positive in each case, is the concept of false positive rate relevant?