I am trying to extract a specific table from a pdf, the pdf looks like the image below
I tried with different libraries on python,
With tabula-py
from tabula import read_pdf
from tabulate import tabulate
df = read_pdf("./tmp/pdf/Food Calories List.pdf")
df
With PyPDF2
pdf_file = open("./tmp/pdf/Food Calories List.pdf", 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
data = page_content
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
aux = page_content
df = pd.DataFrame([x.split(';') for x in aux.split('\n')])
Even with textract and beautiful soup, the issue that I am facing is that the output format is a mess, Is there any way to extract this table with a better format?