Comparing a test vs gold standard in 4x4 matrix

Hello everybody,

I am trying to figure out how to measure the specificity and recall of a test I'm working in. The test is aimed to classify items into 1 out of 4 qualitative variables, of which 3 can be quantitatively arranged (let's say, V1<V2<V3, but the 4th variable is totally independent from the previous ones, it's a kind of marginal category). Therefore I assume that the variables can not be sorted and I treat them as "qualitative nominal variables" (= categorical variable).

More specifically. the test is aimed to classify words from a given text of chemistry into any of these:

-"technical chemistry word" (V1),
-"semi-technical chemistry word" (V2),
-"non-technical/no chemistry word" (V3), and
-"special word" (V4).

The last category would include words which are confusing and thus require a further research to find out into which of the other 3 categories it should be classified (for example, acronyms, which may be "technical chemistry word" or "no chemistry word" if they belong to another area such as medicine). As you can see, it is possible to arrange V1, V2 and V3 according to their degree of specialization, but this is not possible with V4 because it is kind of "marginal". Here comes my first question:

Q1: Is there any better design (i.e. arranging my test in terms of ordinal variables), or should I follow the above-mentioned design?

As I mentioned, my plan is to study the specificity and sensitivity of the test (machine-based test). To do so I want to compare it with a gold standard test (human-based test). For such purpose I picked 15 words from each category (V1-V4, according to the machine-based test)*** and compiled a 60-word list that was later classified according to the human-based test. Therefore, if I'm not wrong, I would be able to calculate the specificity and sensitivity of the machine-based test.

*** Does it matter if I compile the 60-word list from the machine-based classification? Or should I rather compile it from the human-based classification?

Here come my next questions:

Q2: How can I measure the specificity and sensitivity/recall of the machine-based test correctly? I'm figuring out that I should make some 4x4 matrix and compare true positives (TP) vs TP + FN but I haven't found anything similar on the Internet (only for 2x2 matrices).

Q3: Is there any way in which I can weight the error derived from classifying a word in a "further category? (i.e. if the machine-based test classifies a word that the human-based test classified as V1, as a word from V3 instead of V2 -which is closer in terms of degree of specialization-).

Q4:After naked-eye reviewing the results from the 60-word list, I can see that the machine-based test results in some categories being more specific than others. If this is confirmed using some formula, would it be correct to conclude that the test is very specific to classify words as "technical words" and "non-technical/no chemistry words", but is not specific for the other two categories? Or should I rather say that the categories "technical" and "non-technical" are very specific unlike the other 2 categories? (or do both sentences express the same and thus are equally correct)

Q5: I picked 15 words from each category based on a similar study that I found in the bibliography, but is there any formula to decide how many words I should include in the whole test/each category for reliable results?

I know my explanation may be confusing but I'm quite lost at the moment and I would appreciate your help so much. Thank you so much in advance.