-1

I am using attached image to scan and retrieve table using google cloud document AI, but I am getting attached text as output. Attached along with source code file as well. Output generated is not as expected as table contents are not generated propertly.

I have given source code in link :- https://drive.google.com/file/d/1W7dcolWE8Ie7YDubE6Yt0CCNDJ_lhg7X/view?usp=sharing

Image file used for scanning is mentioned below:- https://drive.google.com/file/d/1nBKMR5wvai8zzMV8b5JsW8ZgGPCD1Jwv/view?usp=sharing

The generated output is kept at location mentioned below:- https://drive.google.com/file/d/1EbrWAQsvTjuH8fOAvikwF4sz18kURYCh/view?usp=sharing

vishal
  • 31
  • 5

1 Answers1

0

Can you please verify that there is no PII in that sample document provided?

Can you also specify what you would expect the output to be like?

The table output starting with Table with 8 columns and 28 rows: appears to be accurate.

Note: You may need to adjust the code shown in the samples to print or store it exactly as you want. There may also be some post-processing required to remove unwanted characters.

I'd also recommend referring to the Java code samples on this page for handling the processing response to see if the output from the processor is more in line with what you are expecting.

You can try doing some basic post processing to separate the Qty and Total fields by applying a regex or pattern matching to separate the two different data elements in that column.



You can also try using a specific version of Form Parser such as:

  • pretrained-form-parser-v1.0-2020-09-23
  • pretrained-form-parser-v2.0-2022-11-10

I'd also recommend trying out the Invoice Parser processor since this document type is similar to an invoice. This processor will give you named entities instead of a generic table layout.

You can also use Uptraining to add in custom fields for pretrained processors and improve accuracy by creating labeled training data.



You can watch the videos on this page to get more information about these features. https://cloud.google.com/document-ai/docs/videos

Holt Skinner
  • 1,692
  • 1
  • 8
  • 21
  • Can you please verify that there is no PII in that sample document provided? Yes, PII info is stripped off. As per suggested , I tried version "pretrained-form-parser-v2.0-2022-11-10" and output was little better. But it had following problem. Output with version "pretrained-form-parser-v2.0-2022-11-10" with same image is , for first row:- Row data: [VENFLON PRO SAFETY 22G X 25MM 393222 BD|] | [30/11/2022 14:14|] | [2140832/31/05/25/BD|] | [1.00441.00|] | [0.00|] | [441.00|] Problem is that 1.00441.00 is considered as single column, while they are two different. – vishal May 24 '23 at 04:55
  • I was able to get comparatively better output from AWS Textract – vishal May 26 '23 at 05:06
  • Ok, that description makes sense. Updated my answer with more specific info – Holt Skinner May 30 '23 at 19:47