Comparing frequencies

I have two columns of word frequencies. Column A has 49 items (and each item's relative frequency value), and column B as 40 items. The values, for example, are like 12.83, 11.02 etc. The values are calculated by frequency relative to text length (so, the "frequencies" may not be actual frequencies so much as they are values.)

The first item is the highest value, the second item is the 2nd highest value etc. And the values form a curve.
Each column, therefore, has values that form a curve.
The first curve is much steeper than the second curve ... but how do I show that the steepness of the first curve is statistically significantly different from the steepness of the second curve?

Many thanks for any help!


TS Contributor
You have 89 cases (items), each has 3 characteristics: which group it belongs to (A vs. B), its frequency value, and its rank within its group.
You could perform a linear regression, regressing frequency value on rank, group and the rank*group interaction. The interaction will tell you whether the steepness of the regression line is different between groups A and B.

With kind regards

Thanks for the reply ... my issue though is that it's clearly a curve ... doesn't that mean that a linear regression is not an option?


TS Contributor
So you could search for a function which resembles the shape you observed. Something like rank^2 or log (rank) as predictors. Or maybe it is a nonlinear regression like e^rank.

Keep in mind that all this is just like playing around for fun, since your hypothesis and your model are being set up post-hoc, they are tailored to the peculiarities of your sample data, and p-values will not be seriously interpretable.

With kind regards

Agreed - it's post hoc. But it would be a method I would then be able to use for future analyses.

So, let's say I have a function that resembles my shape. Indeed, I would have two of them. This brings me back tot he question; how do I compare them?

As of now, this is what I do.
I compare the highest value for column A with the highest value of column B - and note the "winner."
I repeat this for each point, noting the winner each time.
I then use Fisher's Exact to asses the number of winners for Column X (given the number of chances Column X had to be a winner).
And this works just fine when there are the same number of items in each column.
Yes, this approach doesn't take into account the size of the difference between points, but in my data sets, the first point difference will always be extremely large, the second much less so, and after about five points, the differences are very small indeed. So if I do take into account the "size of the difference" then I'm giving too much power to the first point. For this reason, I feel the Fisher's Exact approach has some merit. Some "conservative" merit (and obviously some weaknesses).

I'd appreciate your continued thoughts ...

And I really do appreciate your input.