When extracting a table using camelot, the text of two columns that is close together is merged into one, even though all lines are detected correctly. I am using the lattice flavor, as the table in the PDF has lines. I set split_text = True
but it has no effect.
I got it to work correctly meanwhile, but I don't know why it didn't work before.
Here is the code example, that doesn't work:
Example file: test.pdf.
# -*- coding: utf-8 -*-
from pdfminer.layout import LAParams
from pdfminer.high_level import extract_text
import camelot
file = "test.pdf"
laparams = LAParams(
line_overlap=0.5,
char_margin=0.5, # tried decreasing this parameter, default: 5
word_margin=0.1,
line_margin=0.0,
boxes_flow=0.5,
detect_vertical=False,
all_texts=False
)
# extract the table
tables = camelot.read_pdf(
file,
flavor='lattice',
pages="1",
process_background=False,
line_tol=2,
joint_tol=2,
line_scale=30, # increased from 15 to detect smaller lines
layout_params = laparams,
split_text = True
)
# the grid is extracted correctly
camelot.plot(tables[0], kind='grid').show()
# the texts are not split at the grid where they should
# specifically the text 'Requirement/Function/Configuration' and 'GxP' are merged together
camelot.plot(tables[0], kind='text').show()
# notice that when using pdfminer, the char_margin parameter makes a difference
# but in camelot.read_pdf it doesn't seem to affect the text extraction
texts = extract_text(file, page_numbers=[0], maxpages=1, laparams=laparams)
texts = texts.split('\n')
print(texts)
I added the text and grid plot. As you can see, the columns are detected correctly, but the text spans over two columns. I marked the cell and the place where the text should be split.
Here is the code that works. I just pass the arguments to read_pdf()
as a dictionary. I don't know why this makes a difference.
import camelot
file = "test.pdf"
laparams = {
'line_overlap': 0.5,
'char_margin': 0.5,
'word_margin': 0.1,
'line_margin': 0.0,
'boxes_flow': 0.5,
'detect_vertical': False,
'all_texts': False
}
camelotArgs = {
'flavor': 'lattice',
'process_background': False,
'line_tol': 2,
'joint_tol': 2,
'line_scale': 30, # increased from 15 to detect smaller lines
'split_text': True,
'layout_kwargs': laparams
}
# extract the table
tables = camelot.read_pdf(
file,
pages="1",
**camelotArgs
)
# show results
camelot.plot(tables[0], kind='grid').show()
camelot.plot(tables[0], kind='text').show()
Python version: 3.7.11
camelot-py version: 0.10.1
pdfminer.six version: 20211012