I'm trying to extract information out of a table that is inside of a pdf file. The table(found on page 32 of https://iservice.lombardini.it/documents/ProdCateg/1411/ED0053029340_MO_KDW_702_1003_1404.pdf) looks like this:
I've been using tabula-py and python-tabulate to extract the tables, and I am able to extract information from cells that have some sort of characters in them. For example, if I were to run to run the following code sample: `
import tabula, tabulate, requests
from io import BytesIO, StringIO
response = requests.get("https://iservice.lombardini.it/documents/ProdCateg/1411/ED0053029340_MO_KDW_702_1003_1404.pdf")
pdf_data = BytesIO(response.content)
df = tabula.read_pdf(pdf_data, pages=32)
print(tabulate.tabulate(df, tablefmt="psql"))
` I would get this as a result: image. My issue is that the information I need from this table comes from the cells that are shaded in, and I'm finding that since the shaded in cells do not contain characters, tabula-py isn't able to extract anything from them. Based off of this: https://stackoverflow.com/a/72293700/16344960, I think that is happening due to tabula-py making assumptions about what information is apart of the table based on horizontal and vertical lines, but I'm not sure of much beyond that. I've also tried to extract these cells using PyPDF2 and pdfminer(to read the text) and have had no luck with them either.