I have 3 different texts, and for each one of them, I have the frequencies of 25 letters appearing in those texts (Text A, letter A, freq = 5013; Text A, letter B, freq = 1462,...). These texts have different, and not normal distributions.
I would like to know, if there's any significant difference between the texts, in terms of how the letters are ranked by frequency (for example: letter A is the most frequent letter in texts A and B, but only 5th-most frequent in text C; letter E is 2nd-most frequent in text A, 3rd-most frequent in text B - almost no difference - and only 9th most frequent in text C; etc.).
I performed a Kruskal-Wallis test by ranks, and you can see the results of the test on some random texts below. What concerns me here are two things:
- Kruskal-Wallis test calculates the p value depending on the mean ranks, so the actual order of items (in this case letters) is not important,
- One of the assumptions of the test is that the distributions have to have the same shape (variability), which is not the case here.
Either way, the results on the sample data showed no significant difference, even though text A and B were texts from a book, and text C was a computer-generated text, and the ranks clearly differed (like for example, for letter A, ranks (and frequencies) are almost the same for text A and B, but completely different for text C).
I am wondering, if I'm even using the right test to test my hypothesis? In case I'm not - which test should I use?
Tweet |