method test_set_size correct accuracy

A 898 188 0.209

A+ 898 210 0.234

A++ 898 217 0.242

B 1021 230 0.225

C 926 214 0.231

ABC 1099 267 0.243

D 1117 232 0.208

ABCD 1122 280 0.250

Above the results from an experiment I conducted. It's about an algorithm that tries to predict which value a certain variable is going to have.

Column one lists the variants of the algorithm.

Column two lists the number of observations that were used to test the algorithm's performance.

Column three lists the number of correct predictions.

Column four lists the individual methods' accuracy (column three divided by column two).

I now have to compute statistical significance. (Yes, have to. Am being forced by reviewers.)

I'd like to know:

1) if the improvement from A to A+ is significant

2) if the improvement from A+ to A++ is significant

3) (If they are not, is the improvement from A to A++ significant?)

4) Are the differences between A, A++, B and C significant? (Or could it be just by chance that e.g. C is better that B.)

5) Is the improvement from D to ABCD significant.

Now here is the first questions:

Which measures do I use? t test? ANOVA?

And the second question, which might be a bit philosophical:

What is the point of this?

Let's take the case of methods A and A+ in the table above. I used the exact same test sets, and the results do increase. THEY INCREASE. PERIOD. There are some cases where A+ works better that A. I designed these methods, I know what they do, and it is impossible that A+ would perform worse then A. A+ has to perform equally good or better than A. I performed the experiment to see whether A+ really performs better (as opposed to equally well), and to see how much of an improvement we get. So what does this significance value tell me? As far as I am concerned it is proven that A+ is better than A, because we found some cases where A+ performs better and it is impossible to find cases where A+ performs worse.

I'd be very grateful for any help!

Regards,

Marc