Advice for analyzing a small sample

I am currently working on my doctoral thesis in computer science and I have data from an experiment that I want to analyze, but I don't want to run the risk of over-analyzing my results while I stumble around trying to figure out which tests are appropriate. Hence, I want to develop a solid, respectable protocol for what tests I'm going to employ BEFORE opening up the data files, and then stick to that protocol.

In a perfect world, this plan would have been worked out BEFORE collecting the data, but the opportunity to access the particular population I studied came up quickly and there wasn't time. Due to the limited window of opportunity, and the time required to collect each sample, I have a small population here. Just 10 people. And since the window has closed, I can't go back and raise my sample size to 30, either. I'm stuck with what I've got. What I'm now looking for is some creative ways to tease meaningful insights from this limited data, and to do so in a defensible, rigorous manner.

The experiment involved showing each user a series of 24 images, and then asking them to score each one on several different subjective critieria, such as quality of the image's composition, level of emotional response to the image, artistic merit of the image's color palette, etc. They're entirely subjective assessments, and each attribute was scored for each image by each user, on a 5-point scale.

The images in the trial were collected using two different algorithms, and the respondents evaluated them in a mingled collection, presented in random order. Users had no way of knowing which images were chosen by which algorithm. In fact, users don't even know the point of the experiment. They were just asked to evaluate for the subjective criteria.

What I want to do is test for correlation between the scores given for the different attributes versus the algorithm used to collect the images. Is it true that one algorithm generates a collection that scores better than the other when judged by humans? But since my sample is so small (10), I can't use Pearson's Chi-squared because the expected distribution of responses in each of the 5 categories from chance is only 2, and Pearson requires at least 5.

Summarizing the data: I have 10 participants who each scored 24 images (drawn from two equal-sized classes) for 5 subjective attributes on a 5-point scale.

Here are my specific questions:

1) What's the second-best way to explore correlation between the attribute scores and the two population classes? (I realize the BEST way would be to increase my sample size, but the conditions under which the data was collected cannot be reproduced, so I'm stuck with finding another method.) Should I be collapsing my Likert scale into a nominal scale? Is there a different correlation test that should be used for small sample sizes?

2) Showing that one particular attribute correlates well with the two algorithm classes would be nice, but it would be even better if I could show that the attributes correlate collectively as well. I presume that the best way to do this is to aggregate the 5 attributes scores using some kind of "combined score function" and then to test correlation between that aggregate score and the two classes, but is there a more accepted way of doing this?

3) What ideas do people have for how to visualize these results? One method I've explored is to show two histograms of the distribution of responses for a given attribute, one graph for each class, and drawing attention to any changes in basic shape of the graph, such as clustering, skew shift, modality changes, etc.

4) What have I missed? I'm a fish out of water when it comes to stats, and I fully expect that there are some standard evaluations or analyses that should be applied to data like mine that I'm not even aware of.

I hope this doesn't constitute looking for someone to do my homework for me. I'm just looking for a guide through the wild and woolly forest of statistical testing theory. References to appropriate papers to read, tests to apply, or questions to investigate are all I'm hoping for here, and will be very much appreciated.

If I understand correctly, you have a series of 2x5 contingency tables to analyse? Two rows for the two algorithms, and 5 columns for the scores? I would suggest Fisher's Exact Test. The website below will let you perform exact tests on up to 6x6 contingency tables...

You could also try a nonparametric equivalent of the t-test, such as Mann-Whitney U to compare median scores between the two algorithms.

Hope that helps.
That was exactly what I was looking for. Thank you.

But now, having sniffed along that path a little further, I have a trickier question. As I understand them, these significance tests all rely on comparing the likelihood of the observed distribution against a chance-caused distribution, and they all have an in-built assumption that the chance distribution is uniform. But what if it isn't?

In my situation, I'm dealing with subjective scores assigned to art objects by human judges, and I suspect that due simply to central tendency bias, the null-hypothesis scores are more likely to be normally distributed than uniformly and I haven't got a clue how to adapt to this difference. Is there a different test to employ for that?

But worse, this raises a subtle issue. Suppose there IS a test with a Gaussian distribution assumption. What are the chances that my scenario has an exactly Gaussian distribution? Pretty small, I'd bet. So even a test with an assumed Gaussian distribution for the null-hypothesis is likely to be wrong, and there's no way of telling how wrong. This suggests to me that I've framed the data incorrectly, as it's too sensitive to knowing the proper distribution function.

So maybe I shouldn't actually be trying to assess the likelihood of the particular score distributions I've collected, but rather, I should be trying to measure the likelihood of the DIFFERENCES I've measured between the two populations. (The hypothesis I'm studying states that one algorithm will result in a higher set of scores than the other.)

My null-hypothesis, then, is that there is no difference between the two sets of scores (Alg A vs Alg B) and so any measured differences must be from chance.

This suggests that my data should be assessed as a table of score differentials, like this:

C1 C2 C3 C4 C5
J1 D11 D12 D13...
J2 D21 D22 ...
J3 D31 D32 ...

D11 is the total score given by judge 1, for criteria 1, to all images selected by Algorithm A minus the total for all images selected by Algorithm B. This gives me an integer matrix with each cell having a max possible value of +48, a min value of -48 and a chance-given expectation of 0. Finding an overall bias toward positive values means that Alg A creates higher scores than B, negative values would suggest that B creates the higher scores, and a tendency toward 0 suggests that the algorithms are moot.

So does my tortured logic make sense? Is this a sane way to frame my data? And if so, what's the correct significance test for THIS setup?
Actually thinking about this some more... maybe things aren't as complicated as they first seem. Do you think it would be valid to combine (or perhaps average) the scores across all criteria? I would suggest organising your data into a 240 row array such as...

U:user {1-10} ; P=photo {1-24}; A=algorithm {1/2} ; C1-C5=judging criteria {1-5} ; T=total score

U | P | A | C1 | C2 | C3 | C4 | C5 | T
1 | 1 | 2 | 03 | 02 | 04 | 02 | 05 | 16
1 | 2 | 1 | 01 | 03 | 03 | 01 | 03 | 11
1 | 3 | 2 | 03 | 04 | 05 | 05 | 02 | 19
1 | 24 | ...
2 | 1

Then you could run a Mann-Whitney U test with a null hypothesis that the distributions of "total score" are equal between algorithm 1 and 2. (You could also run MW-U tests for each 5 criteria individually.). You could also run a Kruskal-Wallis test to see if there's any difference between User and Photo total scores.

That's how I'd think about starting, anyway.
Last edited:
That looks reasonable. Being non-parametric, MW-U avoids the issues I was concerned about in my previous post regarding appropriateness of the distribution function. (Although I'm still curious about whether there's merit and precedent for my -48 to +48 matrix approach. It would give me a measure of how good or bad my instincts are for all this stats stuff. :)

As for aggregating the criteria scores into an overall score, that will be part of my research analysis: to examine the data and determine whether they correlate strongly enough to justify merging them into a single metric. But at the very least, doing a criteria-by-criteria analysis will give me some idea of how each of the criteria behave across the two search algorithms.

Thanks for your thoughtful input. Now I just have to go off and write up my plan.