0

My task is to extract text from a scanned document/ JPG and then get only below mentioned 6 values so that I can auto-fill a form-data in my next screen/ activity.

I used google cloud vision api in my android app with a Blaze version(paid), And I got the result as a text block, but I want to extract only some of information out of them, how I can achieve that?

Bills or receipt can be different all the time but I want 6 things out of all the invoices text block for Ex -

  1. Vendor
  2. Account
  3. Description
  4. Due Date
  5. Invoice Number
  6. Amount

Is there any tool/3rd party library available so that I can use in my android development.

Note - I don't think any sample of receipt or bill image needed for this because it can be any type of bill or invoice we just need to extract 6 mentioned things from that extracted text.

Bajrang Hudda
  • 3,028
  • 1
  • 36
  • 63
  • https://github.com/tesseract-ocr/tesseract this one is very good. If there is no interface for Android, you could make a PHP script which receives your image via a simpel POST (from your application), process it, and sends back to your app a valid JSON with requested data (vendor, account, etc). – besciualex Dec 04 '19 at 10:59
  • 1
    If you already receive text, you can 'filter' it. For example you establish a start delimiter for Vendor, and an end delimiter. Then you parse the received text and extract what you need. Example of input data: `Vendor name: Baj Bussiness Phone: +4546464446546`. The data for field `vendor` is between `Vendor name:` and `Phone`. You extract that data using this `start` and `end` delimiters. – besciualex Dec 04 '19 at 11:00
  • @besciualex, It's good approach if I am using just a single format receipt, but in my case there can me n number of bill/ receipt format so such kind of logic is surly going to fail. – Bajrang Hudda Dec 04 '19 at 11:10
  • The number of bill formats is finite? or completely unknown? Two weeks ago I did something similar for a friend, and he had about 10 unique PDF templates (with different values in form fields). We did 10 templates, each one with its own delimiters. Each PDF had something unique, which could be used to identify the correct delimiters template. You could implement the same logic if you have a finite number of bill formats. – besciualex Dec 04 '19 at 11:25
  • One application like yours, we had in our mall. You would scan your bill and then receive points. If the application failed to identify info from a bill, it would tell the user that his image will be manually processed by someone. The person who process the image will also be in charge of creating a new template of delimiters. As far as I know what you want there (to be 100% automatically for an infinite formats) it requires Artificial Intelligence. – besciualex Dec 04 '19 at 11:34
  • let's say I have finite number of formats, what can be the logic for that? May you help me over this – Bajrang Hudda Dec 04 '19 at 11:50
  • Yes, check my answer. – besciualex Dec 05 '19 at 07:27

1 Answers1

1

In the next scenarios I will create two fictive bill formats, then write the code algorithm to parse them. I will write only the algorithm because I don't know JAVA.

enter image description here

On the first column we have great pictures from two bills. In the second column we have text data obtained from OCR software. It's like a simple text file, with no logic implemented. But we know certain keywords that can make it have meaning. Bellow is the algorithm that translates the meaningless file in a perfect logical JSON.

// Text obtained from BILL format 1
var TEXT_FROM_OCR = "Invoice no 12 Amount 55$
Vendor name BusinessTest 1 Account No 1213113
Due date 2019-12-07  
Description Lorem ipsum dolor est"




// Text obtained from BILL format 2
var TEXT_FROM_OCR ="    BusinessTest22        
Invoice no    19    Amount    12$
Account    4564544    Due date    2019-12-15
Description            
Lorem ipsum dolor est            
Another description line            
Last description line"




// This is a valid JSON object which describes the logic behind the text
var TEMPLATES = {


    "bill_template_1": {
        "vendor":{
            "line_no_start": null,                // This means is unknown and will be ignored by our text parsers
            "line_no_end": null,                  // This means is unknown and will be ignored by our text parsers
            "start_delimiter": "Vendor name",     // Searched value starts immediatedly after this start_delimiters
            "end_delimiter": "Account"            // Searched value ends just before this end_delimter
            "value_found": null                   // Save here the value we found
        },
        "account": {
            "line_no_start": null,                // This means is unknown and will be ignored by our text parsers
            "line_no_end": null,                  // This means is unknown and will be ignored by our text parsers
            "start_delimiter": "Account No",      // Searched value starts immediatedly after this start_delimiters
            "end_delimiter": null                 // Extract everything untill the end of current line
            "value_found": null                   // Save here the value we found
        },
        "description": {
            // apply same logic as above
        },
        "due_date" {
            // apply same logic as above
        },
        "invoice_number" {
            // apply same logic as above
        },
        "amount" {
            // apply same logic as above
        },
    },


    "bill_template_2": {
        "vendor":{
            "line_no_start": 0,                    // Extract data from line zero
            "line_no_end": 0,                      // Extract data untill line zero
            "start_delimiter": null,               // Ignore this, because our delimiter is a complete line
            "end_delimiter": null                  // Ignore this, because our delimiter is a complete line
            "value_found": null                    // Save here the value we found
        },
        "account": {
            "line_no_start": null,                // This means is unknown and will be ignored by our text parsers
            "line_no_end": null,                  // This means is unknown and will be ignored by our text parsers
            "start_delimiter": "Account",         // Searched value starts immediatedly after this start_delimiters
            "end_delimiter": "Due date"           // Searched value ends just before this end_delimter
            "value_found": null                   // Save here the value we found
        },
        "description": {
            "line_no_start": 6,                   // Extract data from line zero
            "line_no_end": 99999,                 // Extract data untill line 99999 (a very big number which means EOF)
            "start_delimiter": null,              // Ignore this, because our delimiter is a complete line
            "end_delimiter": null                 // Ignore this, because our delimiter is a complete line
            "value_found": null                   // Save here the value we found
        },
        "due_date" {
            // apply same logic as above
        },
        "invoice_number" {
            // apply same logic as above
        },
        "amount" {
            // apply same logic as above
        },
    }
}


// ALGORITHM

// 1. convert into an array the TEXT_FROM_OCR variable (each index, means a new line in file)
// in JavaScript we would do something like this:

TEXT_FROM_OCR = TEXT_FROM_OCR.split("\r\n");


var MAXIMUM_SCORE = 6; // we are looking to extract 6 values, out of 6


foreach TEMPLATES as TEMPLATE_TO_PARSE => PARSE_METADATA{

    SCORE = 0; // for each field we find, we increment score


    foreach PARSE_METADATA as SEARCHED_FIELD_NAME => DELIMITERS_METADATA{

        // Search by line first
        if (DELIMITERS_METADATA['line_no_start'] !== NULL && DELIMITERS_METADATA['line_no_end'] !== NULL){

            // Initiate value with an empty string
            DELIMITERS_METADATA['value_found'] = '';

            // Concatenate the value found across these lines
            for (LINE_NO = DELIMITERS_METADATA['line_no_start']; LINE_NO <= DELIMITERS_METADATA['line_no_end']; LINE_NO++){

                // Add line, one by one as defined by your delimiters
                DELIMITERS_METADATA['value_found'] += TEXT_FROM_OCR[ LINE_NO ];

            }

            // We have found a good value, continue to next field
            SCORE++;
            continue;
        }



        // Search by text delimiters
        if (DELIMITERS_METADATA['start_delimiter'] !== NULL){



            // Search for text inside each line of the file
            foreach TEXT_FROM_OCR as LINE_CONTENT{

                // If we found start_delimiter on this line, then let's parse it
                if (LINE_CONTENT.indexOf(DELIMITERS_METADATA['start_delimiter']) > -1){

                    // START POSITION OF OUR SEARCHED VALUE IS THE OFFSET WE FOUND + THE TOTAL LENGTH OF START DELIMITER
                    START_POSITION = LINE_CONTENT.indexOf(DELIMITERS_METADATA['start_delimiter']) + LENGTH( DELIMITERS_METADATA['start_delimiter'] );


                    // by default we try to extract all data from START_POSITION untill the end of current line
                    END_POSITION = 999999999999; // till the end of line


                    // HOWEVER, IF THERE IS AN END DELIMITER DEFINED, WE WILL USE THAT
                    if (DELIMITERS_METADATA['end_delimiter'] !== NULL){

                        // IF WE FOUND THE END DELIMITER ON THIS LINE, WE WILL USE ITS OFFSET as END_POSITION
                        if (LINE_CONTENT.indexOf(DELIMITERS_METADATA['end_delimiter']) > -1){

                            END_POSITION = LINE_CONTENT.indexOf(DELIMITERS_METADATA['end_delimiter']);

                        }
                    }


                    // SUBSTRACT THE VALUE WE FOUND
                    DELIMITERS_METADATA['value_found'] = LINE_CONTENT.substr(START_POSITION, END_POSITION);

                    // We have found a good value earlier, increment the score
                    SCORE++;

                    // break this foreach as we found a good value, and we need to move to next field
                    break;
                }

            }

        }
    }


    print(TEMPLATE_TO_PARSE obtained a score of SCORE out of MAXIMUM_SCORE):
}

At the end you will know which template extracted most of the data, and based on this which one to use for that bill. Feel free to ask anything in comments. If I stayed 45 minute to write this answer, I'll surely answer to your comments as well. :)

besciualex
  • 1,872
  • 1
  • 15
  • 20
  • Is there any 3rd party tool or library available by using that if we can extract 6 mentioned things or we need to go for machine learning part that I am not aware about it anything. – Bajrang Hudda Dec 11 '19 at 05:30
  • 1
    To learn AI it takes time. That's why I recommend the algorithm from above. The library used in it (the one that translates from IMAGE => BLOCK OF TEXT) is called https://github.com/tesseract-ocr/tesseract – besciualex Dec 11 '19 at 08:10