Borderless pdf extraction to json is not working properly for Python camelot library

Question

Can anyone give me quick answer/help that as we are facing some issue after pdf extraction to json using python camelot is not giving exact content. some content is missing after extraction.

https://www.dropbox.com/s/vernt20ntt1z8rt/essart_wochenpla_zwei%20Scheibenhaus%20%281%29.pdf?dl=0 — Goutam Ghosh, Sep 24 '20 at 14:02

Stefano Fiorucci - anakin87 · Answer 1 · 2020-09-25T07:05:42.377

0

I tried the following code:

import camelot

pdf_path = '/YOUR/FILEPATH.pdf'
tables = camelot.read_pdf(pdf_path, flavor='stream')

Here are two problems:

headers font is not properly read, so you find strange characters like (cid:71)...
using flavor='lattice', the table isn't detected. Using flavor='stream', the table is detected, but the cells aren't properly detected.

At the moment, I think that Camelot can't properly extract this table. They are working on fixing the second problem (see this and this).

edited Sep 25 '20 at 07:05

answered Sep 24 '20 at 13:40

Stefano Fiorucci - anakin87

3,143
7
26

same problem happened with me too – Goutam Ghosh Sep 24 '20 at 13:50
I am sorry that this problem can't be solved using Camelot. If my answer Is useful, please mark It as accepted and upvote It. – Stefano Fiorucci - anakin87 Sep 24 '20 at 22:45
is there any other library so that we can solve this? – Goutam Ghosh Sep 30 '20 at 05:02
But extracttable.com is for image to other format. We need pdf to json. – Goutam Ghosh Sep 30 '20 at 11:49

Borderless pdf extraction to json is not working properly for Python camelot library

1 Answers1