I have different implementations of different classification models (Discriminant Analsis, SVM, Neural Networks, Decision trees, etc) a total of 40 implementations and I need to compare them in two different sets of variables.

I run each implementation on both sets of variables: set A involving 20 variables and set B the same as A plus other 5 variables. For each run the performance measure is the accuracy.

Besides I run 3 resampling methods for each implementation: 10-fold cross validation, bootstrap and leave one out.

And now I'm stuck in trying to get any conclusion because for some implementations I get better results in group A or B depending on the resampling method and also for the same model for example Decision Trees with an implementation I get better results in group A and for others I get better results in group B and also depending on the resampling.

Is there any statistical test I could apply maybe studing A and B for each model? for example with the mean value of all the implementations for one model (e.g. for Decision Trees the mean accuracy in group A and B and statistically compare them with mean accuracy of SVM)