First of all, I’m new to Python, so please bear with me. I have a PDF file with Spanish vocabulary on the left and the German translation on the right. Sometimes there are also a few example sentences to show how the sentence is used. Here’s how the PDF looks like:
I want to write a Python script, which takes all the vocabulary, the translation and the example sentences (+ the translation) and get a CSV file with four columns. This is how the CSV file should look like:
I can read line by line, which works fine, if there are no example sentences. If there’s an example sentence, however, the line looks somewhat like this:
Für Senioren gibt es bei Hay descuentos en los viajes
Reisen Ermässigung. para la tercera edad.
The Spanish sentence should look like this: Hay descuentos en los viajes para la tercera edad.
The German sentence should look like this: Für Senioren gibt es bei Reisen Ermässigung.
Ideally, the two example sentences should be added to the "base" word, so to "la tercera edad" / "die Senioren" in my example above. For "la tercera edad", there should be four columns. Sometimes, there are no example sentences; in that case, I just need two columns.
Here's what I've done:
import pdfplumber
pdf = pdfplumber.open('spanish.pdf')
page = pdf.pages[23]
text = page.extract_text()
# Read each line
for line in text.split('\n'):
print(line)
Printing line
outputs amongst others the following:
Für Senioren gibt es bei Hay descuentos en los viajes
Reisen Ermässigung. para la tercera edad.
Maybe there's a way to do it with tabuly-py
? I'd appreciate any help.
Cheers.