I have run EasyOCR in Python over a large number of black and white images of the text on soldered components, with the goal of collecting the writing on each of them. The results are mostly good, but there are some inconsistent results that I would like to filter out.
I have used multiple pictures of the same component and they are all labeled, so my DataFrame looks like this.
ID | OCR Guesses |
---|---|
component 1 | [RNGSE, BN65E, 8NGse, BN65E, BN65E] |
component 2 | [DFEAW, DFEAW, DF3AW, DFEAW] |
component 3 | [1002, 1002, l002, 1002] |
As you can see, most of the letters are identified correctly, but sometimes one of the letters is identified as a number or vice versa. Is there an easy method to "take the average" of these strings to find the most likely correct OCR result? The result I am aiming for would look like the following:
ID | OCR Guesses | Correct |
---|---|---|
component 1 | [RNGSE, BN65E, 8NGse, BN65E, BN65E] |
BNGSE |
component 2 | [DFEAW, DFEAW, DF3AW, DFEAW] |
DFEAW |
component 3 | [1002, 1002, l002, 1002] |
1002 |
It would be great if there was a module that takes into account common confusing characters such as 1 and l, 6 and G, B and R etc.
Any help is appreciated. Thanks!