0

I'm trying to extract information out of a table that is inside of a pdf file. The table(found on page 32 of https://iservice.lombardini.it/documents/ProdCateg/1411/ED0053029340_MO_KDW_702_1003_1404.pdf) looks like this: image

I've been using tabula-py and python-tabulate to extract the tables, and I am able to extract information from cells that have some sort of characters in them. For example, if I were to run to run the following code sample: `

import tabula, tabulate, requests
from io import BytesIO, StringIO

response = requests.get("https://iservice.lombardini.it/documents/ProdCateg/1411/ED0053029340_MO_KDW_702_1003_1404.pdf")
pdf_data = BytesIO(response.content)
df = tabula.read_pdf(pdf_data, pages=32)
print(tabulate.tabulate(df, tablefmt="psql"))

` I would get this as a result: image. My issue is that the information I need from this table comes from the cells that are shaded in, and I'm finding that since the shaded in cells do not contain characters, tabula-py isn't able to extract anything from them. Based off of this: https://stackoverflow.com/a/72293700/16344960, I think that is happening due to tabula-py making assumptions about what information is apart of the table based on horizontal and vertical lines, but I'm not sure of much beyond that. I've also tried to extract these cells using PyPDF2 and pdfminer(to read the text) and have had no luck with them either.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • What information are you seeking to extract? It appears that the shaded boxes are empty. – Chris Happy Jun 02 '22 at 20:58
  • 1
    Sorry, I should have been more specific-I'm looking to get information about whether or not the cell is shaded – Max Bromet Jun 02 '22 at 21:00
  • Speaking entirely from my own personal experience with similar requirements: while *possible*, it's extremely unlikely that the PDF file itself contains the appropriate information for *any* non-OCR solution to correctly/reliably extract this shading data. Without getting to into the technical weeds, most PDF drivers with which these types of documents are exported use methodologies which decontextualize this formatting data from the broader tabular layout. (1/2) – esqew Jun 02 '22 at 21:36
  • You're better off evaluating more OCR-based libraries to assist in doing this job, but even then OCR's primary capability is to *read textual* data, not make decisions on what table cells are which color. (2/2) – esqew Jun 02 '22 at 21:37
  • @esqew I'll give OCR-based libraries a try! As far as OCR-based libraries go, would you say that pytesseract is a good starting place? – Max Bromet Jun 02 '22 at 21:41
  • @MaxBromet As your use case is particularly unique in this regard, I would say the underlying Tesseract OCR engine is not a good fit for this use case - as I mentioned above, it (like most OCR libraries or providers) specialize in *text* extraction - I would go so far as to say you may be needing to build an entirely custom machine learning-driven OCR model for your own purposes. For my clients evaluating these types of use cases, it generally tends to skew much easier if you work to attain the underlying data that gets used to build these graphics instead of trying to shoehorn something else. – esqew Jun 02 '22 at 21:48
  • I found a way to get cell color using pdfplumber. see my post: https://stackoverflow.com/questions/72291875/how-can-i-extract-the-background-color-of-a-table-cell-within-a-pdf-file-using-p/73759921#73759921 – toshi Sep 18 '22 at 03:32

0 Answers0