1

First of all, I’m new to Python, so please bear with me. I have a PDF file with Spanish vocabulary on the left and the German translation on the right. Sometimes there are also a few example sentences to show how the sentence is used. Here’s how the PDF looks like:

Example of PDF

I want to write a Python script, which takes all the vocabulary, the translation and the example sentences (+ the translation) and get a CSV file with four columns. This is how the CSV file should look like:

Example of ideal CSV

I can read line by line, which works fine, if there are no example sentences. If there’s an example sentence, however, the line looks somewhat like this:

Für Senioren gibt es bei   Hay descuentos en los viajes
Reisen Ermässigung.   para la tercera edad.

The Spanish sentence should look like this: Hay descuentos en los viajes para la tercera edad. The German sentence should look like this: Für Senioren gibt es bei Reisen Ermässigung. Ideally, the two example sentences should be added to the "base" word, so to "la tercera edad" / "die Senioren" in my example above. For "la tercera edad", there should be four columns. Sometimes, there are no example sentences; in that case, I just need two columns.

Here's what I've done:

import pdfplumber

pdf = pdfplumber.open('spanish.pdf')
page = pdf.pages[23]
text = page.extract_text()

# Read each line
for line in text.split('\n'):
    print(line)

Printing line outputs amongst others the following:

Für Senioren gibt es bei   Hay descuentos en los viajes
Reisen Ermässigung.   para la tercera edad.

Maybe there's a way to do it with tabuly-py? I'd appreciate any help.

Cheers.

orejoorejo
  • 11
  • 1

0 Answers0