I need to scan documents and check if it contains specific data. To put it "simply", assume I need to find if a scanned invoice contains a specific address.
The given address to search could be written in different ways compared to how it's written in the document, e.g.:
address to search (Italian address): "Piazza Santa Rita 43, 10390, Torino(TO)
address in the scanned document could be like: "Torino, P.zza S.Rita 43, 10390, Torino" or "Pizza S.Rita 43, 10390, Torino" and so on
I'm looking for a way to find a kind of "similarity" between the data to search, so that if I find a text close to, let say, 80% I consider it a valid document
Apart from how the address is typed, another problem that arises is that the scanned document could be (most of the time will be) of poor quality, so the OCR engine can misinterpret some character, giving bad results (like a 'c' became an 'o', a '3' became a 'B', etc... so I want to take this into account too
e.g. the scanned document could lead to a ""Plzza S.Rita 4B, 1O390, Tcrinc"
Any advice about how to solve this problem?
Actually I'm developing this on Android, using OpenCV to deskew document picture and Google Firebase ML-KIT to scan on-device the document (I can't rely on external services, I must solve it on-device) so I should solve this using Java and looking from the text found by the ml-kit ocr but even if you have advice that implements this in other languages/platform is fine as a reference.