0

I have a pdf file (for now it's just one for testing, but i would like to find a generic solution as I'll have more file in the near future), in it couple of tables with different formats, and the language of the tables is RTL (and not LTR).

Another challenge I'm facing is the tables' structure which is not consist, not in the table itself and not between tables. Meaning, I have a table looking like this:

enter image description here

As you can see some rows are merged together, at the first row I have two headers and so on. Another table on the same file is this:

enter image description here

First two lines are the headers and some data, than for couple of lines the header in on the right hand side of the table, and than another row with header followed by data. You can see the number of columns is changing through the table itself, and table1 doesn't look like table 2.

I need to read this pdf, extract the tables somehow that would keep the information in a way which would make sense, and make it generic enough to handle more than this pdf.

Libraries such tabula-py, img2table, camelot gives me the ability to read the tables, but the extraction of the tables is bad at best.

I didn't find (yet) any solution for reading and extracting this kind of tables from pdf, but I'm sure I'm not the first one to tackle this.

Any thoughts, methods, suggestions on how to solve this would be great.

Alon
  • 45
  • 2

0 Answers0