Using the tabula package for python, I am trying to extract tables from multiple pdf files. This works beautifully for multi-rowed tables, however, some of the pdf files have tables with only a single row. When trying to convert these pdfs, it returns an empty list. It makes sense that these files are problematic since a single-rowed table is essentially just another line of text.
However, it is important that these pdfs are also converted into DataFrames since they appear fairly frequently in my dataset. Unfortunately, the pdf files are proprietary so I can't show them here. I'm hoping that this limitation does not prohibit a solution from being found. Below is the line of code that does the conversion.
df = tabula.read_pdf(DIRECTORY + file_name, pages = 'all', pandas_options={'header': None}, encoding="utf-8")
I've attempted to solve this problem in a few ways. First I tried inserting an extra row in the original pdf files from the source, unfortunately, this is impossible. I tried using the tips on the tabula-py website (https://tabula-py.readthedocs.io/en/latest/faq.html#i-got-a-empty-dataframe-how-can-i-resolve-it):
- Set a specific area for accurate table detection.
- Try lattice = True option for the table having explicit line.
- Try stream = True option
Following the first tip, I tried specifying an area using measurements taken in Adobe. This still returned an empty DataFrame. I tried the second and third tips and this again returned an empty list.
So the question I have is: "Is there a way to let the tabula-py package identify tables with only a single row from a pdf?"
I'm hoping that someone knows how to solve this problem. Thanks in advance for the effort.