I try to read a pdf with tabula.py read_pdf() method and pandas. Works fine, except for multiline textfields like given below:
I´m expecting the following output after writing df to list:
['Gewürzmischung Zaatar', 'Kartoffeln (Drillinge)']
But I´m getting:
['Gewürzmischung', 'Kartoffeln (Drillinge)', 'Zaatar']
I would appreciate any tips/workarounds on how to solve this problem.
My code looks like this:
#!/usr/bin/env python3
import tabula as tb
import pandas as pd
import sys
file = sys.argv[1]
zutatenliste = tb.read_pdf(file, area = (178, 621, 600, 788),encoding='utf-8', pages = '1')[0]
zutatenliste.fillna('', inplace=True)
zutatenliste2 = []
for i in zutatenliste.values.tolist():
for j in i:
if j == '':
continue
zutatenliste2.append(j)
print(zutatenliste2) # prints ['Gewürzmischung', 'Kartoffeln (Drillinge)', '"Za\'atar"']