Finding the most likely correct string from multiple OCR results of the same text in Python

Question

I have run EasyOCR in Python over a large number of black and white images of the text on soldered components, with the goal of collecting the writing on each of them. The results are mostly good, but there are some inconsistent results that I would like to filter out.

I have used multiple pictures of the same component and they are all labeled, so my DataFrame looks like this.

ID	OCR Guesses
component 1	`[RNGSE, BN65E, 8NGse, BN65E, BN65E]`
component 2	`[DFEAW, DFEAW, DF3AW, DFEAW]`
component 3	`[1002, 1002, l002, 1002]`

As you can see, most of the letters are identified correctly, but sometimes one of the letters is identified as a number or vice versa. Is there an easy method to "take the average" of these strings to find the most likely correct OCR result? The result I am aiming for would look like the following:

ID	OCR Guesses	Correct
component 1	`[RNGSE, BN65E, 8NGse, BN65E, BN65E]`	BNGSE
component 2	`[DFEAW, DFEAW, DF3AW, DFEAW]`	DFEAW
component 3	`[1002, 1002, l002, 1002]`	1002

It would be great if there was a module that takes into account common confusing characters such as 1 and l, 6 and G, B and R etc.

Any help is appreciated. Thanks!

Welcome to SO. 2 Qs: 1. in col `OCR Guesses`, are your values just strings, e.g. `[RNGSE, BN65E, 8NGse, BN65E, BN65E]`? or are they actual lists *with* strings, e.g. `['RNGSE', 'BN65E', '8NGse', 'BN65E', 'BN65E']`? 2. How do you determine whether you are looking for an alphabetical, numerical, or alphanumerical sequence (string). E.g. with the first "list", `BN65E` is more likely than `BNGSE` and with your last example, one could imagine a scenario in which `looz` would be the most likely guess, rather than `1002`. — ouroboros1, Aug 20 '22 at 15:45
Thank you for your response! I have stored the OCR guesses in a list for convenience. — bonjoery, Aug 20 '22 at 16:56
The data is a complete mix of alphabetical, numerical or alphanumeric strings, which makes it challenging. However, I am not aiming for fully automated perfection, as there are some strings that were quite badly misinterpreted. I mainly want to correct the small mistakes, and approximate the more indecisive ones to correct later. — bonjoery, Aug 20 '22 at 17:04

Rodrigo Laguna · Answer 1 · 2022-08-21T00:17:12.847

You can find the Levenshtein distance (or edit distance) for each pair of guesses, and then select the one which is closer to all other.

There are many libraries implementing Levenshtein distance, for this example I'll use editdistance (there may be better implementations with more parameters to tune, this is one I just found).

import numpy as np
import editdistance

guesses = ['foo', 'foo 2', 'Foo 2']
pair_distances = np.zeros((len(guesses), len(guesses))

for i, gi in enumerate (guesses):
    for j, gj in enumerate (guesses):
        pair_distances[i, j] = editdistance.eval(gi, gj)

sum_distances = np.sum(pair_distances, axis=0)

idx_min = np.argmin(sum_distances)

best_guess = guesses[idx_min]

Note that np.argmin broke ties by keeping the first match. Previous code may lead to situations where multiple candidates has the best distance. You can take some other decision to break ties, like considering the best guess with case-insesitives (i.e. just same code but convert guesses to low case before computing). However, this may also lead to ties.

That said, this code snippet should work, but it is not so efficient (every distance is computed twice since d(i, j) == d(j, i) and d(i, i) is always 0, so don't need to compute it)) but I think it is clear enough to explain my point.

Waltharnack · Answer 2 · 2022-11-16T10:43:45.390

One simple way would be to count the number of occurrences of each characters and to take each time the most frequent character.

For example:

pred_list = ["DFEAW", "DFEAW", "DF3AW", "DFEAW"]
avg_string = ""

for i in range(len(pred_list[0])):
    character_count = {}
    
    for pred in pred_list:
        if pred[i] not in character_count:
            character_count[pred[i]] = 1
        else: 
            character_count[pred[i]] += 1
    
    avg_string += max(character_count, key=character_count.get)

print(avg_string)

Result: "DFEAW"

Note that this approach doesn't take into account the frequently confused characters.

If there's a possibility of misalignment between the OCR results (e.g. the OCR predicted two characters instead of one, there is an extra space...) you would need to first align the different strings between each other (see: Multiple Sequence Alignement).

The python-Levenshtein module can be useful in that case:

import Levenshtein 
Levenshtein.median(["  DFEA W", "DFEAW", "DF3AW", "DFEAVV"])

Result: "DFEAW"

Finding the most likely correct string from multiple OCR results of the same text in Python

2 Answers2