Can anyone please suggest me how to extract tabular data from a PDF using python/java program for the below borderless table present in a pdf file?
Asked
Active
Viewed 3,092 times
3 Answers
2
This table might be difficult one for tabla. How about using guess=False, stream=True
?
Update: As of tabula-py 1.0.3, guess
and stream
should work together. No need to set guess=False
to use stream
or lattice
option.

chezou
- 486
- 4
- 12
-
Hi @chezou Thanks you for your comment. I tried your answer with the below code `tabula.convert_into("/Downloads/Test_Invoices/Invoice4.pdf", "/Downloads/Test_Invoices/Invoice4.csv", output_format="csv",spreadsheets=True,guess=False, stream=True) ` But no table has been extracted – Richie Aug 08 '18 at 12:12
-
Hi @chezou , Any other Python/Java related libraries you know? – Richie Aug 08 '18 at 12:14
-
1I recommend you to set `pages` option. By default, tabula-py sets 1. – chezou Aug 08 '18 at 23:18
-
Hi @chezou , how would i do that? I'm not quite familiar with specifying these parameter values.. – Richie Aug 09 '18 at 04:52
-
**Here's my code** `df = tabula.read_pdf("/Downloads/Invoice1.pdf",guess=False, stream=True) print(df)` – Richie Aug 09 '18 at 04:53
-
2set `pages="all"` lor `pages=2` for `read_pdf()` or `convert_into()`. For future detail, it’d be nice you to read the manual https://github.com/chezou/tabula-py/blob/master/README.md or you can check test codes https://github.com/chezou/tabula-py/blob/master/tests/test_read_pdf_table.py – chezou Aug 09 '18 at 04:59
-
Thanks for your update, I updated my code and went through the docs as you mentioned, but no hope in extracting it. Its alright i'm working on Python's **invoice2Data** code on Github and its helping a bit. Just need to make this process automated. Thanks a lot for your application. Its great..! – Richie Aug 09 '18 at 05:56
0
I solved this problem via tabula-py
conda install tabula-py
and
>>> import tabula
>>> area = [70, 30, 750, 570] # Seems to have to be done manually
>>> page2 = tabula.read_pdf("nar_2021_editorial-2.pdf", guess=False, lattice=False,
stream=True, multiple_tables=False, area=area, pages="all",
) # `tabula` doc explains params very well
>>> page2
and I got this result
> 'pages' argument isn't specified.Will extract only from page 1 by default. [
> ShortTitle Text \ 0
> Arena3Dweb 3D visualisation of multilayered networks 1
> Aviator Monitoring the availability of web services 2
> b2bTools Predictions for protein biophysical features and 3
> NaN their conservation 4
> BENZ WS Four-level Enzyme Commission (EC) number ..
> ... ... 68
> miRTargetLink2 miRNA target gene and target pathway
> 69 NaN networks
> 70 mmCSM-PPI Effects of multiple point mutations on
> 71 NaN protein-protein interactions
> 72 ModFOLD8 Quality estimates for 3D protein models
>
>
> URL 0 http://bib.fleming.gr/Arena3D 1
> https://www.ccb.uni-saarland.de/aviator 2
> https://bio2byte.be/b2btools/ 3
> NaN 4 https://benzdb.biocomp.unibo.it/ ..
> ... 68 https://www.ccb.uni-saarland.de/mirtargetlink2 69
> NaN 70 http://biosig.unimelb.edu.au/mmcsm ppi 71
> NaN 72 https://www.reading.ac.uk/bioinf/ModFOLD/ [73
> rows x 3 columns]]
This is an iterable obj, so you can manipulate it via for row in page2:
Hope it help you

zhangjq
- 132
- 1
- 6
0
Tabula-py borderless table extraction:
Tabula-py has stream which on True detects table based on gaping.
from tabula convert_into
src_pdf = r"src_path"
des_csv = r"des_path"
convert_into(src_pdf, des_csv, guess=False, lattice=False, stream=True, pages="all")

dataninsight
- 1,069
- 6
- 13