Test wanted for comparing algorithms

Study design
Suppose I want to compare algorithm runtime performance (measured in seconds) of different algorithms for the same problem type. There are several problem instances for this problem type on which the algorithms are tested and performance is measured.

Possible tests
As far as I know, the following is true (please correct me if not!):
  1. We want to compare 2 algorithms on 30 independent test instances: In this case, one could use a unpaired Student's t-test or Mann–Whitney U test.
  2. We want to compare 3 or more algorithms on 30 independent test instances: I know that pairwise comparison with Student's t-test should not be done, so one could use ANOVA or Friedman / Quade tests to test for H0: µ1 = µ2 = µ3. But how can one test for "the fastest" algorithm in case H0 has been rejected?

Now each instance has a property "size", which is a non-negative integer. Let us assume that for each size in {10,20,30}, there are 10 instances of this size (30 instances in total). In this case, I have no idea which tests could be used for:
  1. The comparison of 2 algorithms on 30 test instances, 10 instances for each size in {10,20,30}.
  2. The comparison of 3 or more algorithms on 30 test instances, 10 instances for each size in {10,20,30}.
It sounds as if the data is matched so a paired test in case 1 (t or Wicoxon) and a repeated measures test in case 2 (repeated measures anova or Friedman) would be siutable. In case 2 you can then use t or Wicoxon as post hoc tests with a suitably reduced critical p value (Bonferroni is common and easy to understand).
Is size a scale? Is the difference between size 10 and size 20 much the same as between sizes 20 and 30?
Oh, the data is indeed matched, so a paired t-test / Wilcoxon signed-rank test should be used. Thank you!

I am not sure about the term "scale" in this context, so let me give you an example. Suppose, we want to find the shortest cyclic tour visiting n cities (https://en.wikipedia.org/wiki/Travelling_salesman_problem). In this case, the integer n would be the size of the instance, specifying how many cities must be visited. Instances of the same size differ in terms of city coordinates.

So to answer you question, I would say that the difference between sizes 10 and 20 is the same as between sizes 20 and 30. On the other hand, what we measure to compare the algorithms is still the runtime. And here doubling the instance size might result in an exponential increase of runtime.

Maybe the question is what I want to achieve.
  1. The test instances just have different sizes, but I do not want specific results for each size. In that case, I probably could group the 10 instances of each of the 3 sizes together and treat them as 30 instances as was discussed in the first part of the original question. Here I would try to find the "overall best algorithm".
  2. If I additionally want specific results for each size such as "algorithm A performs better on small instances, algorithm B better on large instances" (which is what I would like), is then the size a second independent variable? Or can I just divide the 30 instances into 3 groups (one for each size) and perform tests for each size separately?
Last edited:
The structure of your test looks like a two way repeated measures anova. One factor is algorithm and the other is size. If the data is suitable, then looking at the anova interaction will also let you answer questions of the type in part 2 above. Try this and see if the residuals are normal and equally spread across the predicted values.
If residuals aren't well behaved, you can try a transformation of the times and see if that works.
If not, there is no Friedman type test for the two way repeated measures anova as far as I know. You can try a permutation anova.
If the worst comes to the worst, you can do a series of smaller tests as you suggest, but you need to guard against the increased risk of false positives produced by multiple p values. (Bonferroni again.)
Thank you, this is of great help! Could you explain to me your advice was influenced by my answer to your question regarding the scale property of the size? Would you consider size a scale in this context?

I will probably need some time to educate me on the two way repeated measures anova / residuals / transformation procedures, but I will definitely try this out.
At the simplest level, variables can be either numerical (on a scale of some sort), or categorical. So, algorithm type is categorical. Size can be thought of as a category, (10/small, 20/medium, or 30/large) or as a number on a scale of perhaps 1 to 50, say. It sounds like the former would be most useful in your case. If you have two categorical independent variables, algorithm and size, you do a two way (repeated measures) anova. If you have a numerical and a categorical independent variable you do a (repeated measures) ancova (analysis of covariance). They're really two versions of a more general idea called a General Linear Model.
If you have suitable software, I suggest a trying a two way repeated measures anova. Calculate the means for each size/algorithm combination. Calculate the residuals. Plot the residuals against the means for each group. Check visually if the spread of the residuals is fairly even on each side of the mean, and the spread is much the same for each group. A lot of software will do this for you if you ask.