+ Reply to Thread
Results 1 to 2 of 2

Thread: Universe size vary between original and the material to compare

  1. #1
    Points: 5, Level: 1
    Level completed: 9%, Points required for next Level: 45

    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Universe size vary between original and the material to compare




    This might seem quite trivial to a stats whizz but I just can't wrap my head around it.

    So right now, I'm working on a project to validate the accuracy of Optical Character Recognition. we have 50 samples that we've taken off the recognized material.

    We do not have the ground truth, that is, the original text as a computer string so we can't compare the two by a computerized Levenshtein-distance calculation. So we've compared the scan and the recognition manually.

    We do have an estimation of how many characters there are in the original.

    Now we have the problem of deciding which universe to take from. The thing with OCR is that it tends to add and delete characters. And therefor, the universe size of the original isn't the same as the universe size of the recognized text.

    So if the word jam in the original has been recognized as ../ja*n. we've got the inserted ../, , the m being recognized as *n which we count as an insertion and a substitution and an inserted dot ('.') at the end. If we'd set the universe to the recognized text (Out of the 8 recognized characters, 6 were wrong) we'd have an error-rate of 6/8. But if we'd rather set the universe to the original, the number of errors would still be 8 but the universe would only be 3. So then we'd have an error rate of 8/3 which is a lot worse.

    Is there any way of coping with this problem? Or is it only a matter of choosing the comparation that best suits our needs.

  2. #2
    Omega Contributor
    Points: 38,392, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    7,000
    Thanks
    398
    Thanked 1,186 Times in 1,147 Posts

    Re: Universe size vary between original and the material to compare


    Is this a machine learning problem? I personally was lost in your description, I think you need to lay out the problem with a very basic description without jargon.
    Stop cowardice, ban guns!

+ Reply to Thread

           




Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats