0

I have a PDF file that includes a table and I want to convert it into table structured data.

My PDF file includes a pretty complex table which makes most tool insufficient. For example, I tried to use the following tools and they didn't extract it well: AWS Textract, Google AI Document, Google Vision, Microsoft Text Recognition. Actually, Google AI Document managed to do about 70% correct but it is not good enough.

So, I searched for a way to customize train model, so that when extracting this table, it will extract it properly. I tried Power Apps AI Builder and Google AutoML Entity Extraction, but both of them didn't help (BTW, I wasn't what AutoML's purpose, is it for prediction or also possible to customize table extraction?).

I would like to know which tools are good for my use case and if there is any (AI) tool that I can use to train these kind of tables, so that the text extraction will be better.

enter image description here

  • Please **re-read** [What topics can I ask about here?](https://stackoverflow.com/help/on-topic), as it would seem that you missed some crucial points the first time you read it, namely that questions asking us to *recommend or find a book, tool, software library, tutorial or other off-site resource* are off-topic for SO. – desertnaut Oct 13 '21 at 12:30

1 Answers1

1

Most text extractors should hold that structure if it is rendered crisp enough, but layout can be many a fickle mis-trees.

Here it correctly picked up the mis-spelling of reaar but failed in first line on 05.05.1983

enter image description here

On an identical secondpass the failings are different

 3      29.06.1983      Part of Ground Floor of       05.05.1983      GM315727
        2 (part of)     Conavon Court                25 years from
                                                     1.3.1983
 4      31.01.1984      Part of Third Floor Conavon   30.12.1983      GM335793
        4 (part of)     Court                        25 years from
                                                     12.8.1983
 5      19.04.1984      I?art of Basement Floor of     23.01.1984      GM342693
        l (part of), 2  Conavon C:ourt                25 years from
         (part of), 3                                 20.01.1984
         (part Of ) , 4
         (part of)
        NOTE: The Lease also grants a right of way for the purpose only of
        loading and unloading and reserves a right of way in case of emergency
        only from the  boiler house adjacent hereto
 6      14.06.1984      Part of Third Floor Conavon   31.10.1983      GM347623
        3 (part of)     Court                        25 years from
                                                     31.10.1983
 7      14.06.1984      Part of the Third Floor       31.10.1983      GM347623
        3 (part: of}, 4  Conavon Court                25 years from
         (part of)                                    31.10.1983
 8      01.10.1984      "The Italian Stallion''       17.08.1984      GM357142
        4 (part of)     Conavon Court (Basement)      25 years from
                                                     20.1.1984
        NOTE: The Lease also grants a right of way for the purpose only of
        loading and unloading and a right of access through the security door
        at the reaar of the building
 9      06.07.2016      3rd floor 14-16 Blackfriars   28.06.2016
        4 (part of}, 5  Streec                       5 years from
         (part of)                                    25/06/2016

That's the beauty of OCR, every run can be a different pass rate per character so experience says use best of three estimates. Thus run 3 different ways and comparing character by character keep those that are in agreement.

K J
  • 8,045
  • 3
  • 14
  • 36
  • @K J Thank you. I am not sure what do you mean in the last words in the sentence - "best of three quotations". Could you please explain? – Kristina Hammer Oct 13 '21 at 07:23