Calculate OCR accuracy

Question

I need to calculate OCR character accuracy

Sample ground value:

Non sinking ship is friendship

Sample ocr value input:

non singing ship is finedship

Areas of concern are:

missed characters
extra characters
misplaced characters

Character accuracy is defined by the number of actual characters with their places divided by the total of actual characters.

I need a python script to find this accuracy. My initial implementation is as follows:

ground_value = "Non sinking ship is friendship"
ocr_value = "non singing ship is finedship"
ground_value_characters = (re.sub('\s+', '',
                                      ground_value)).strip()  # remove all spaces from the gr value string
    ocr_value_characters = (re.sub('\s+', '',
                                   ocr_value)).strip()  # remove all the spaces from the ocr string 

 total_characters = float(len(
        ground_value_characters))  

def find_matching_characters(ground, ocr):
  total = 0
  for char in ground:
    if char in ocr:
      total = total + 1
      ocr = ocr.replace(char, '', 1)
  return total

found_characters = find_matching_characters(ground_value_characters,
                                                ocr_value_characters)

accuracy = found_characters/total_characters

I couldn't get what I was hoping for. Any help would be appreciated.

This has nothing to do with floating-accuracy. – Scott Hunter Aug 22 '20 at 02:22 — Scott Hunter, Aug 22 '20 at 02:22

score 4 · Accepted Answer · answered Aug 22 '20 at 03:13

If you're not married to that precise definition (or if you are and want to delve into the details of python-Levenshtein), then this is how I would solve this:

pip install python-Levenshtein

from Levenshtein import distance

ground_value = "Non sinking ship is friendship"
ocr_value = "non singing ship is finedship"

print(distance(ground_value, ocr_value))

The same library will give you Hamming distance, opcodes, and similar functions in a relatively high-performance way.

None of this will be useful if eg this is a homework assignment or your purpose here is to learn how to implement string algorithms, but if you just need a good metric, this is what I would use.

score 0 · Answer 2 · answered Jan 18 '22 at 23:16

You can use SequenceMatcher. It gives what you want,

from difflib import SequenceMatcher

ground_value = "Non sinking ship is friendship"
ocr_value = "non singing ship is finedship"

sm = SequenceMatcher(None, ocr_value, ground_value)
true_positive_char_num = 0
for tag, i1, i2, j1, j2 in sm.get_opcodes():
  if tag== 'equal':
    true_positive_char_num += (j2 - j1)
  else:
    pass

print(f'accuracy = {true_positive_char_num/len(ground_value)}')

accuracy = 0.8666666666666667

Here we first create SequenceMatcher object and use get_opcodes() method that gives details how to turn prediction into ground truth value. To count true chars, we only use 'equal' tag.

See https://docs.python.org/3/library/difflib.html#sequencematcher-objects for more details.

Calculate OCR accuracy

2 Answers2

Linked