Pdfplumber misses first column and last row for all tables within a schematic

Question

I am new to pdfplumber, and I have fallen amazed under how it extracts text from tables.

Its easy to work for all-page tables, but in my case, I am using some topological schematics with somes tables inside.

It fails to extract the first column and the last row of every table in document. I have tried to tweak several configuration parameters in table_settings variable, unluckily I haven't been able to achieve any better result (in my case, the rest of texts in the schematic is considered as a table in case I use "text" instead of "lines").

Any help with this? I am using Python 3.9.8 and the pdf for testing can be found in: schematic.pdf

The source code is next:

import pdfplumber
pdf_file = "Schematic.pdf"
tables=[]
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    tbl = pages[0].extract_tables()
    
    print(f'{tbl}')

For those stumbling here, this question has already been answered at https://github.com/jsvine/pdfplumber/discussions/544#discussioncomment-1681858 — Samkit Jain, Nov 30 '21 at 15:53
@Samkit Jain You may want to answer to my question also here in SO to allow people have an available response. Most people don't even read the comments if the question has not been answered. — Pablo, Dec 02 '21 at 18:35

score 1 · Accepted Answer · answered Dec 03 '21 at 11:24

Some of the edges in the PDF appear as lines but are not exactly what pdfplumber treats as lines and for such cases, all the curves and edges can be explicitly treated as lines. Using the following table settings worked for this case

{
    "vertical_strategy": "explicit",
    "horizontal_strategy": "explicit",
    "explicit_vertical_lines": page.curves+page.edges,
    "explicit_horizontal_lines": page.curves+page.edges,
    "intersection_tolerance": 15,
}

['(cid:47)(cid:44)(cid:54)(cid:55)(cid:36)(cid:3)(cid:39)(cid:40)(cid:3)(cid:39)(cid:40)(cid:54)(cid:57)(cid:203)(cid:50)(cid:54)', None, None, None, None, None]
['(cid:49)(cid:158)', 'PK', 'VEL.', '(cid:49)(cid:158)', 'PK', 'VEL.']
['A64', '3+100', '100 Km/h', 'A66', '3+365', '100 Km/h']
['A65', '3+189', '100 Km/h', 'S2MSU2', '5+884', '100 Km/h']
['A67', '3+363', '100 Km/h', 'S4MSU1', '6+052', '100 Km/h']
['', '', '', '', '', '']

['(cid:54)(cid:40)(cid:102)(cid:36)(cid:47)(cid:40)(cid:54)', None, None, None]
['NOMBRE', 'PK', 'NOMBRE', 'PK']
['E3', '3+720', 'EMSUF2', '5+766']
['E4', '3+784', 'EMSUF1', '5+766']
['B004F2', '4+295', 'SMSUM2', '6+185']
['B004F1', '4+295', 'SMSUM1', '6+188']
['', '', '', '']

Thank you and congratulations for this great library – Pablo Dec 06 '21 at 02:14 — Pablo, Dec 06 '21 at 02:14

Pdfplumber misses first column and last row for all tables within a schematic

1 Answers1