Best OCR approach on documents with different formats to find one specific information

Question

Unfortunately, because of confidential data, I can't give a more specific explanation.

The Problem

So I've got a few documents that in general contain the same information but have different formats. In most cases, the value I am looking for is near a keyword on the document. The OCR itself is taken care of by the Google Cloud Vision API but what is the best approach to handle the different formats?

My idea

... was to train a classifier that detects what format I am dealing with and then picks the appropriate way of finding the target value, I implemented beforehand by hand. This is not handy nor scalable. So I am looking for some algorithm I tell e.g. where the target value is, what it looks like etc.

What is the best ML-approach for this problem or what are your ideas?

As an example of the type of data: Let's say I have receipts from 20 different supermarkets and I am looking to find the total cost, with the problem that every companies receipt looks different.

score 1 · Accepted Answer · answered Jun 19 '19 at 14:50

Recently I had to deal with a similar situation using tesseract, excluding the OCR tool itself, I didn't use any ML-approach because like you said, it wouldn't be scalable.

I don't think a classifier would payoff unless you have a huge amount of different layouts, and then you'd have to decide how to extract the data for each and every layout...

It depends a lot on the type of data you need to extract, but using your example, if you had to extract the total cost from all the different layouts, you could extract as many numbers as you can from each receipt, and score them based on some factors, like:

If its a cost ($ or other currency symbols)
The distance to some common keywords like "Total, Final, Sum, etc"
If it's the highest value for that receipt
Other factors you might think of, it all depends on the data you need to extract

Then you can calculate the final total cost using the individual costs that scored the highest for each receipt

Sounds like that could work out. Thanks for your thoughts, I'll try that out. — Patrick_Weber, Jun 19 '19 at 16:27

Best OCR approach on documents with different formats to find one specific information

The Problem

My idea

1 Answers1