How to check if a scanned document contains an address

Question

I need to scan documents and check if it contains specific data. To put it "simply", assume I need to find if a scanned invoice contains a specific address.

The given address to search could be written in different ways compared to how it's written in the document, e.g.:

address to search (Italian address): "Piazza Santa Rita 43, 10390, Torino(TO)

address in the scanned document could be like: "Torino, P.zza S.Rita 43, 10390, Torino" or "Pizza S.Rita 43, 10390, Torino" and so on

I'm looking for a way to find a kind of "similarity" between the data to search, so that if I find a text close to, let say, 80% I consider it a valid document

Apart from how the address is typed, another problem that arises is that the scanned document could be (most of the time will be) of poor quality, so the OCR engine can misinterpret some character, giving bad results (like a 'c' became an 'o', a '3' became a 'B', etc... so I want to take this into account too

e.g. the scanned document could lead to a ""Plzza S.Rita 4B, 1O390, Tcrinc"

Any advice about how to solve this problem?

Actually I'm developing this on Android, using OpenCV to deskew document picture and Google Firebase ML-KIT to scan on-device the document (I can't rely on external services, I must solve it on-device) so I should solve this using Java and looking from the text found by the ml-kit ocr but even if you have advice that implements this in other languages/platform is fine as a reference.

You can easily perform that kind of string matching by using Regular Expressions. — Phantômaxx, Aug 31 '18 at 11:24
i think you understimate the problem, this is not a pure string matching but string similarity and beside word order, you then have to consider wrong scanned characters that lead to a false negative. I think I should use kind of mixed alghoritms like https://en.wikipedia.org/wiki/Levenshtein_distance — Not Important, Aug 31 '18 at 15:13
Why do you think that Regular Expressions wouldn't work for you? They are commonly used to solve this exact kind of problem. — Phantômaxx, Aug 31 '18 at 15:49

score 0 · Answer 1 · answered Aug 31 '18 at 22:27

0

This is indeed a kinda hard question. I believe your best bet is fuzzy string matching.
There are some Java libraries that should be helpful to you, e.g. JavaWuzzy.

Functions like extractX and sortX should come handy:

FuzzySearch.extractOne("cowboys", ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"])
(string: Dallas Cowboys, score: 90, index: 3)

FuzzySearch.tokenSortPartialRatio("order words out of","  words out of order")

answered Aug 31 '18 at 22:27

wp78de

18,207
7
43
71

I'll be back on code next week, I'll try and let you know if this library can be handy thanks – Not Important Sep 08 '18 at 22:13

How to check if a scanned document contains an address

1 Answers1