0

I try to read a pdf with tabula.py read_pdf() method and pandas. Works fine, except for multiline textfields like given below:

Multiline textfield in PDF

I´m expecting the following output after writing df to list:

['Gewürzmischung Zaatar', 'Kartoffeln (Drillinge)']

But I´m getting:

['Gewürzmischung', 'Kartoffeln (Drillinge)', 'Zaatar']

I would appreciate any tips/workarounds on how to solve this problem.

My code looks like this:

#!/usr/bin/env python3

import tabula as tb
import pandas as pd
import sys 

file = sys.argv[1]    

zutatenliste = tb.read_pdf(file, area = (178, 621, 600, 788),encoding='utf-8', pages = '1')[0] 
zutatenliste.fillna('', inplace=True) 
zutatenliste2 = [] 

for i in zutatenliste.values.tolist(): 
    for j in i:
        if j == '': 
            continue 
        zutatenliste2.append(j)

print(zutatenliste2) # prints ['Gewürzmischung', 'Kartoffeln (Drillinge)', '"Za\'atar"']
noxxer
  • 1
  • 2

0 Answers0