This might seem quite trivial to a stats whizz but I just can't wrap my head around it.

So right now, I'm working on a project to validate the accuracy of Optical Character Recognition. we have 50 samples that we've taken off the recognized material.

We do not have the ground truth, that is, the original text as a computer string so we can't compare the two by a computerized Levenshtein-distance calculation. So we've compared the scan and the recognition manually.

We do have an estimation of how many characters there are in the original.

Now we have the problem of deciding which universe to take from. The thing with OCR is that it tends to add and delete characters. And therefor, the universe size of the original isn't the same as the universe size of the recognized text.

So if the word jam in the original has been recognized as ../ja*n. we've got the inserted ../, , the m being recognized as *n which we count as an insertion and a substitution and an inserted dot ('.') at the end. If we'd set the universe to the recognized text (Out of the 8 recognized characters, 6 were wrong) we'd have an error-rate of 6/8. But if we'd rather set the universe to the original, the number of errors would still be 8 but the universe would only be 3. So then we'd have an error rate of 8/3 which is a lot worse.

Is there any way of coping with this problem? Or is it only a matter of choosing the comparation that best suits our needs.

Is this a machine learning problem? I personally was lost in your description, I think you need to lay out the problem with a very basic description without jargon.

