# Test wanted for comparing algorithms

#### eidooo

##### New Member
Study design
Suppose I want to compare algorithm runtime performance (measured in seconds) of different algorithms for the same problem type. There are several problem instances for this problem type on which the algorithms are tested and performance is measured.

Possible tests
As far as I know, the following is true (please correct me if not!):
1. We want to compare 2 algorithms on 30 independent test instances: In this case, one could use a unpaired Student's t-test or Mann–Whitney U test.
2. We want to compare 3 or more algorithms on 30 independent test instances: I know that pairwise comparison with Student's t-test should not be done, so one could use ANOVA or Friedman / Quade tests to test for H0: µ1 = µ2 = µ3. But how can one test for "the fastest" algorithm in case H0 has been rejected?

Variant
Now each instance has a property "size", which is a non-negative integer. Let us assume that for each size in {10,20,30}, there are 10 instances of this size (30 instances in total). In this case, I have no idea which tests could be used for:
1. The comparison of 2 algorithms on 30 test instances, 10 instances for each size in {10,20,30}.
2. The comparison of 3 or more algorithms on 30 test instances, 10 instances for each size in {10,20,30}.

#### katxt

##### Active Member
It sounds as if the data is matched so a paired test in case 1 (t or Wicoxon) and a repeated measures test in case 2 (repeated measures anova or Friedman) would be siutable. In case 2 you can then use t or Wicoxon as post hoc tests with a suitably reduced critical p value (Bonferroni is common and easy to understand).
Is size a scale? Is the difference between size 10 and size 20 much the same as between sizes 20 and 30?

#### eidooo

##### New Member
Oh, the data is indeed matched, so a paired t-test / Wilcoxon signed-rank test should be used. Thank you!

I am not sure about the term "scale" in this context, so let me give you an example. Suppose, we want to find the shortest cyclic tour visiting n cities (https://en.wikipedia.org/wiki/Travelling_salesman_problem). In this case, the integer n would be the size of the instance, specifying how many cities must be visited. Instances of the same size differ in terms of city coordinates.

So to answer you question, I would say that the difference between sizes 10 and 20 is the same as between sizes 20 and 30. On the other hand, what we measure to compare the algorithms is still the runtime. And here doubling the instance size might result in an exponential increase of runtime.

Maybe the question is what I want to achieve.
1. The test instances just have different sizes, but I do not want specific results for each size. In that case, I probably could group the 10 instances of each of the 3 sizes together and treat them as 30 instances as was discussed in the first part of the original question. Here I would try to find the "overall best algorithm".
2. If I additionally want specific results for each size such as "algorithm A performs better on small instances, algorithm B better on large instances" (which is what I would like), is then the size a second independent variable? Or can I just divide the 30 instances into 3 groups (one for each size) and perform tests for each size separately?

Last edited:

#### katxt

##### Active Member
The structure of your test looks like a two way repeated measures anova. One factor is algorithm and the other is size. If the data is suitable, then looking at the anova interaction will also let you answer questions of the type in part 2 above. Try this and see if the residuals are normal and equally spread across the predicted values.
If residuals aren't well behaved, you can try a transformation of the times and see if that works.
If not, there is no Friedman type test for the two way repeated measures anova as far as I know. You can try a permutation anova.
If the worst comes to the worst, you can do a series of smaller tests as you suggest, but you need to guard against the increased risk of false positives produced by multiple p values. (Bonferroni again.)

#### eidooo

##### New Member
Thank you, this is of great help! Could you explain to me your advice was influenced by my answer to your question regarding the scale property of the size? Would you consider size a scale in this context?

I will probably need some time to educate me on the two way repeated measures anova / residuals / transformation procedures, but I will definitely try this out.

#### katxt

##### Active Member
At the simplest level, variables can be either numerical (on a scale of some sort), or categorical. So, algorithm type is categorical. Size can be thought of as a category, (10/small, 20/medium, or 30/large) or as a number on a scale of perhaps 1 to 50, say. It sounds like the former would be most useful in your case. If you have two categorical independent variables, algorithm and size, you do a two way (repeated measures) anova. If you have a numerical and a categorical independent variable you do a (repeated measures) ancova (analysis of covariance). They're really two versions of a more general idea called a General Linear Model.
If you have suitable software, I suggest a trying a two way repeated measures anova. Calculate the means for each size/algorithm combination. Calculate the residuals. Plot the residuals against the means for each group. Check visually if the spread of the residuals is fairly even on each side of the mean, and the spread is much the same for each group. A lot of software will do this for you if you ask.

#### eidooo

##### New Member
Ok, that makes sense. Using a suitable software should not be an issue, I just need to understand the study design properly.

Question 1: Nature of IV "size"
If I am not mistaken, a two way repeated measures anova requires two within-subjects IVs. I have a question about the within-subjects nature of the IV "size". To this purpose, we can put the algorithm type aside for a moment.

"Size" is a property of the instance, hence each instance has exactly one size. Once two instances have different sizes, they are different instances. Because I can not test the same instances for multiple sizes, my understanding would be that "size" is a between-subjects IV (and algorithm type is still a within-subjects IV). In that case, would a two way mixed anova be more appropriate?

Question 2: "Solution quality" as DV
Suppose we want to measure the solution quality (also on an interval scale) instead of the runtime in seconds. In that case, one sets a fixed time limit and measures the solution quality after the time limit has been reached.

The time to reach reasonably good solutions varies a lot with the instance size (e.g. after 1 hour all instances of sizes 10 and 15 may be solved optimally by all algorithms (hence yielding the same solution quality), but you only get terrible solutions for instances of size 20). Therefore, it seems to make sense to adjust the time limit to the instance size and use a separate time limit for each group of instances of the same size. Can one still apply anova in this case or are conditions between sizes (i.e. the time limits) not allowed to vary in that way?

Last edited:

#### katxt

##### Active Member
Q1. That's right. Size is between subjects. Another way of looking at it is a nested anova with instance nested in size.
Q2. This needs thinking about. One problem I can see is that the responses are unlikely to be normal because of so many 100% quality responses.

#### eidooo

##### New Member
Q1: Thank you for the confirmation. In this case, I will go with a mixed anova.

Q2: Exactly, normality would most probably be an issue. Would it be appropriate to set suitable time limits for each size? If one sets the time limit low enough for each size, one can prevent all responses having 100% quality and I would expect the data to be normally distributed for each pair (size, algorithm). I am just unsure if one is allowed to set different conditions for different sizes, but since size is a between-subjects IV, I think this would be ok. This should be similar to different sizes leading to massively different running times when comparing running times instead of solution quality (instances of small sizes are just solved faster / yield better solutions in a fixed amount of time).