2

When extracting a table using camelot, the text of two columns that is close together is merged into one, even though all lines are detected correctly. I am using the lattice flavor, as the table in the PDF has lines. I set split_text = True but it has no effect.
I got it to work correctly meanwhile, but I don't know why it didn't work before.
Here is the code example, that doesn't work:
Example file: test.pdf.

# -*- coding: utf-8 -*-

from pdfminer.layout import LAParams
from pdfminer.high_level import extract_text
import camelot

file = "test.pdf"

        
laparams = LAParams(
                line_overlap=0.5,
                char_margin=0.5,        # tried decreasing this parameter, default: 5
                word_margin=0.1,
                line_margin=0.0,
                boxes_flow=0.5,
                detect_vertical=False,
                all_texts=False
            )

# extract the table
tables = camelot.read_pdf(
             file, 
             flavor='lattice', 
             pages="1", 
             process_background=False, 
             line_tol=2,
             joint_tol=2,
             line_scale=30,           # increased from 15 to detect smaller lines
             layout_params = laparams,
             split_text = True                                         
        )

# the grid is extracted correctly    
camelot.plot(tables[0], kind='grid').show()
# the texts are not split at the grid where they should
# specifically the text 'Requirement/Function/Configuration' and 'GxP' are merged together
camelot.plot(tables[0], kind='text').show()


# notice that when using pdfminer, the char_margin parameter makes a difference
# but in camelot.read_pdf it doesn't seem to affect the text extraction
texts = extract_text(file, page_numbers=[0], maxpages=1, laparams=laparams) 
texts = texts.split('\n')
print(texts)

I added the text and grid plot. As you can see, the columns are detected correctly, but the text spans over two columns. I marked the cell and the place where the text should be split.

Here is the code that works. I just pass the arguments to read_pdf() as a dictionary. I don't know why this makes a difference.

import camelot

file = "test.pdf"
    
laparams = {
        'line_overlap': 0.5,
        'char_margin': 0.5,
        'word_margin': 0.1,
        'line_margin': 0.0,
        'boxes_flow': 0.5,
        'detect_vertical': False,
        'all_texts': False
    }


camelotArgs = {
            'flavor': 'lattice', 
            'process_background': False, 
            'line_tol': 2,
            'joint_tol': 2,
            'line_scale': 30,           # increased from 15 to detect smaller lines
            'split_text': True,
            'layout_kwargs': laparams
        }

# extract the table
tables = camelot.read_pdf(
         file, 
         pages="1", 
         **camelotArgs
    )

# show results
camelot.plot(tables[0], kind='grid').show()
camelot.plot(tables[0], kind='text').show()

Python version: 3.7.11
camelot-py version: 0.10.1
pdfminer.six version: 20211012
Table Grid Plot Table Text Plot

Tomper
  • 78
  • 7

0 Answers0