Camelot Pdf Extraction FAIL parsing

Question

Im getting a problem with Camelot library

Im extracting data from PDF, my code is running "ok" for previous 23 page, but for this case its failing to parse text/table ending

I suppose the problem is the string is so long reaching table border

Also tried "stream" but got worst results

PDF Source Data

PDF Output LAYOUT

My output parsed is like

"ALT4945\n24 V"
"70\/140 A   ALT5860\n12 V\n90 A"

Desired output should be

"ALT4945\n24 V 70\/140 A"
"ALT5860\n12 V\n90 A"

My first code that work correctly for previous page is

tables = camelot.read_pdf("CROSSREFERENCE.pdf", pages=wPAGES, flavor="lattice")

From the website Camelot Doc https://camelot-py.readthedocs.io/en/master/api.html I get that posible configuration on pdf parser.

"" PARAMS for lattice
line_scale  (default: 15)
copy_text   ((default: None))
shift_text  (default: ['l', 't'])
line_tol    (default: 2)
joint_tol   (default: 2)
threshold_blocksize   (default: 15)
threshold_constant    (default: -2)
iterations   (default: 0)
resolution   (default: 300)
"""

Then I get that problem, tried to solve "playing" with more params, but didnt found the winner

tables = camelot.read_pdf("CROSSREFERENCE.pdf", pages=wPAGES, flavor="lattice", split_text=True, resolution=720, line_scale=250, line_tol=3, joint_tol=3, threshold_blocksize=15)

tables = camelot.read_pdf("CROSSREFERENCE.pdf", pages=wPAGES, flavor="lattice", split_text=True, resolution=720, line_scale=250, line_tol=1, joint_tol=1, threshold_blocksize=3)

Can I get some advice about params to avoid that??

Thanks

edit1: PDF source : https://www.siom.it/images/catalogo-motorini-alter.pdf (Page 24)

Can you attach the file or only this page, in order to make us able to perform some tests? — Stefano Fiorucci - anakin87, Nov 13 '19 at 13:33
It seems really difficult to obtain the result which you want. Maybe for such cases, you can think of some content postprocessing... — Stefano Fiorucci - anakin87, Nov 13 '19 at 14:05
@Anakin87 thanks for waste your time, im actually doing post processing, but cant do it if parser is doing wrongly. I expected some param can help to avoid that problem — Wonka, Nov 13 '19 at 14:31
Your approach of expecting the library to reliably split where you want the splits is wrong. This just is not how PDF works. — fpbhb, Nov 17 '19 at 16:39
@fpbhb so what do you reccomend to extract pdf info? I expected that camelot library work like it should work and it works for almost all pages. I "resolv" the problem with post process text and discard errors — Wonka, Nov 18 '19 at 09:09
The library’s heuristics for properly joining/separating text runs are literally just that: heuristics. PDF generators produce all kinds of weird text runs because of kerning, spacing etc. You’ll always have to check for errors and post-process based on content, or optimize heuristics yourself for your specific case based on more detail from the PDF (page coordinates of text runs, distances, font metrics ...). PDF is a hi-fi display format, not a data container. — fpbhb, Nov 18 '19 at 16:20

score 5 · Answer 1 · edited May 13 '21 at 21:56

Tested Solution

tables = camelot.read_pdf('./catalogo-motorini-alter.pdf', pages='24', 
                           flavor='stream', columns=['300'], split_text=True)

The output of tables[0].df is following:

                                         0                                                  1
0                CATALOGO SIOM ALTERNATORI                   BOSCH  \nBOSCH  \nBOSCH  \nBOSCH
1                       ALT4800\n12 V\n65A                                ALT4830\n12 V\n70 A
2   IMPIANTO : BOSCH\nCOD.OEM : 0120489186             IMPIANTO : BOSCH\nCOD.OEM : 0120488172
3           APPLICAZIONI :\n OPEL VAUXHALL                     APPLICAZIONI :\n OPEL VAUXHALL
4                      ALT4840\n12 V\n70 A                                ALT4890\n12 V\n90 A
5   IMPIANTO : BOSCH\nCOD.OEM : 0120488186             IMPIANTO : BOSCH\nCOD.OEM : 0123315500
6           APPLICAZIONI :\n OPEL VAUXHALL                             APPLICAZIONI :\n IVECO
7                      ALT4900\n12 V\n90 A                            ALT4940\n24 V\n70/140 A
8   IMPIANTO : BOSCH\nCOD.OEM : 0123320009             IMPIANTO : BOSCH\nCOD.OEM : 0120689535
9           APPLICAZIONI :\n AUDI SKODA VW  APPLICAZIONI :\n DROGMOLLER KASSBOHRER MERCEDE...
10                 ALT4945\n24 V\n70/140 A                                ALT5860\n12 V\n90 A
11  IMPIANTO : BOSCH\nCOD.OEM : 0120689541             IMPIANTO : BOSCH\nCOD.OEM : 0120450011
12      APPLICAZIONI :\n MAN MERCEDES BENZ                          APPLICAZIONI :\n CHRYSLER
13                     ALT6600\n12 V\n90 A                                ALT6610\n24 V\n80 A
14  IMPIANTO : BOSCH\nCOD.OEM : 0124325058             IMPIANTO : BOSCH\nCOD.OEM : 0124555001
15            APPLICAZIONI :\n FIAT LANCIA                     APPLICAZIONI :\n MERCEDES BENZ
16                                     Pag                                                .24

Explanation

From the docs it seems that stream parser fits better than lattice for the shared document:

Stream can be used to parse tables that have whitespaces between cells to simulate a table structure.

And for the cases when a stream parser finds incorrect columns separators you can specify them by hand in columns argument (details). Then split_text option says to split text with those columns:)

Discussions

Although fpbhb criticized scraping PDFs in comments, I would be rather optimistic in your specific case. The document you shared is well structured. So I would definitely try to parse it. But the point of fpbhb still correct that it is heuristic. So additional precautions are required.

I suggest you to use regular expressions to test what you got from camelot.

You can use the code below as a starting point:

import re
import logging

def test_tables(tables):
    # headers
    HEADER_L = re.compile('^CATALOGO SIOM ALTERNATORI$')
    HEADER_R = re.compile('^BOSCH  \nBOSCH  \nBOSCH  \nBOSCH$')

    # main cell rows
    CELL_ROWS = [
        re.compile('^ALT\d{4,6}?\n(12|14|24|28) ?V\n\d{2,3}(/\d{2,3})? ?A$'),
        re.compile('^IMPIANTO : .*?\nCOD.OEM : [\dA]{9,10}$'),
        re.compile('^APPLICAZIONI :(\n[A-Z \.-]*)?$')
    ]

    # bottom line should be Pag.##
    PAGE = re.compile('^Pag.\d{1,3}$')

    for ti, table in enumerate(tables):
        rows = table.df.to_numpy()
        # test headers
        if not HEADER_L.match(rows[0, 0]):
            logging.warning('tables[{}].df.iloc[0][0]: HEADER_L != {}'.format(ti, rows[0, 0]))
        if not HEADER_R.match(rows[0, 1]):
            logging.warning('tables[{}].df.iloc[0][1]: HEADER_R != {}'.format(ti, rows[0, 1]))

        # test bottom line
        page_str = ''.join(rows[-1])
        if not PAGE.match(page_str):
            logging.warning('tables[{}].df.iloc[-1]: PAGE != {}'.format(ti, page_str))

        # test cells
        for idx, row in enumerate(rows[1:-1]):
            row_idx = idx % 3
            pattern = CELL_ROWS[row_idx]
            if not pattern.match(row[0]):
                logging.warning('tables[{}].df.iloc[{}][0]: ROW {} != {}'.format(ti, idx+1, row_idx, row[0]))
            if not pattern.match(row[1]):
                logging.warning('tables[{}].df.iloc[{}][1]: ROW {} != {}'.format(ti, idx+1, row_idx, row[1]))

Test first 24 pages

pages_till_24 = ','.join([str(i) for i in range(1,25)])
tables = camelot.read_pdf('./catalogo-motorini-alter.pdf', pages=pages_till_24, 
                          flavor='stream', columns=['300'], split_text=True) 
test_tables(tables)

It gives only one insignificant warning (extra whitespace)

WARNING:root:tables[8].df.iloc[7][1]: ROW 0 != ALT122300
12 V 
45 A

Conclusion

Well, It looks like you can be happy, because it seems to work and you have code to test other pages. Good Luck:)

Wow thanks for your question and your time. I will check it tomorrow. I start my project with flavor="stream" but sometimes it change the order of strings. I will try with you params running over all pages. But my project is finished, post processing text with a bit of auto-learning and I discard errors to /dev/null you know what I means? — Wonka, Nov 21 '19 at 19:42
Hey @MrPisarik I tested your code and seems to run ok!! With "auto-learning" I means to get the manufacter string as key in dict, detected by regex, and using keys to parse future inputs string — Wonka, Nov 22 '19 at 13:48