Is it possible extract a specific table with format from a PDF?

Question

I am trying to extract a specific table from a pdf, the pdf looks like the image below

I tried with different libraries on python,

With tabula-py

from tabula import read_pdf
from tabulate import tabulate 
df = read_pdf("./tmp/pdf/Food Calories List.pdf")
df

With PyPDF2

pdf_file = open("./tmp/pdf/Food Calories List.pdf", 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
        
data = page_content
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
        
aux = page_content
df = pd.DataFrame([x.split(';') for x in aux.split('\n')])

Even with textract and beautiful soup, the issue that I am facing is that the output format is a mess, Is there any way to extract this table with a better format?

score 3 · Answer 1 · answered Jul 22 '20 at 21:46

I suspect the issues stem from the fact that the table have merged cells (on the left) and reading data from a table only works when the rows and cells are consistent rather than some merged and some not.

I'd skip over the first two columns and then recreate / populate them on the left hand side once you have the table loaded (As a pandas dataframe for example).

Then you can have one label per row and work with the data consistently, otherwise your cells per column will be inconsistently numbered.

score 0 · Answer 2 · answered Aug 10 '20 at 16:11

I would look into using tabula templates which you can dynamically generate based on word locations on page. This will give tabula more guidance on which area to consider and lead to more accurate extraction. See tabula.read_pdf_with_template as documented here: https://tabula-py.readthedocs.io/en/latest/tabula.html#tabula.io.read_pdf_with_template.

score 0 · Answer 3 · answered Oct 12 '20 at 05:17

Camelot can be another Python library to try. Its advanced settings seem to show that it can handle merged cells. However, this will likely require some adjustments to certain settings such as copy_text and shift_text.

Note: Camelot can only read text-based tables. If the table is inside an image, it won't be able to extract it.

If the above is not an issue, try the sample code below:

import camelot
tables = camelot.read_pdf('./tmp/pdf/Food Calories List.pdf', pages='1', copy_text=['v'])
print(tables[0].df)

Is it possible extract a specific table with format from a PDF?

3 Answers3