Messy methodology design

Hello everybody,

Let's see if I can sum up the problems I'm facing in a study I am working in since I'm quite new to statistics and thigs are getting a bit messy to me. Hopefully you can help me out.

I want to study the specificity and sensitivity of a test that aims to classify a list of words. The test is based on using 2 dictionaries (one is for specialized terms and the other one is of general language). There are 4 possible categories for a given word according to the existence of an entry for such term in the dictionaries:

A) Specialized term
B) Non-specialized term
C) Semi-specialized term
D) "Special term"

In the literature there are many types of tests that work similarly to this analysis, but as far as I know there is no gold standard test. However, expert-based classifications can be considered a gold standard. Given this fact, I have designed a specific methodology to compare the dictionary-based classification with a "good gold standard" based on the opinion of two experts (I say "good" as opposed to a classification based on the opinion of a single expert, which would be a "bad gold standard"). It goes as follows:

1. I get an initial list of 500 words.
2. I apply the "dictionary-based test" to the list.
3. Because it is hardly manageable to classify all the 500 words in the following steps, I pick up 15 words from each of the four categories (I do this to avoid the possibility of obtaining very assimetrical distributions). Thus I get a 60-word list.
4. I use the 60-word list as a basis for the "expert-based classification". There are 2 independent experts that classify each word in one of the four categories above-mentioned (with specific instructions for each category).
5. I apply a Cohen's kappa test to test the inter-raters reliability (there is a good agreement). Then, the two raters decide to classify the terms where discrepancies had been found into a final category. Therefore I get the wished "good gold standard list".
6. I compare the dictionary-based classification with the gold standard test to study its sensitivity and specificity.

As you can see this is a bit of a mess (at least to me) and I have several concerns at this point:

1. The most important one: is this technically correct? Or there is no way this could be done because of some statistical/research rule that I ignore?

2. I chose 15 words from each category because I saw a similar study which used the same amount of words. However, is there any way to establish the exact number of words that I should pick for statistical consistency?

3. Is it correct to use the 60-word list that was obtained with 15 words from each category from the dictionary-based analysis? As I have mentioned, I did this rather than randomizing 60 words from the 500-word list to obtain a balanced/representative amount of each category (i.e. to avoid getting 30 words from the "specialized term" category and 2 words from the "special term" category). My concern about this point is that it seems a bit weird to me that I first use a list based on the test I want to study.

Assuming that the previous steps were correct:

7. What would be the best method to study the specificity and sensitivity of my test? In other words, can I make 4 different crossed tables (one for each category) so that I get a value of Sensitivity/Specificity for each of them? Or should I rather make one single table to study the general value of sensitivity and specificity of the whole test?

I hope I could explained myself properly since English is not my first language and statistics is not a field I'm familiar with. If you need further information please let me know. Thank you very much in advance.


TS Contributor
I would give the 15 word list to the experts at least twice ( after a suitable period to have them dorget the original decision) to to get an idea if the consistency of the decision. There is no point in trying to get an agreement between the experts if they are inconsistent in the first place.

Thanks for your help, rogojel. I haven't thought about the necessity of testing intraobservers' consistency so I will do what you said. In your opinion, is the rest of the design acceptable?