Problem
I want to extract a 70-page vocabulary table from a PDF and turn it into a CSV to use in [any vocabulary learning app]. Tabula-py and its read_pdf function is a popular solution to extract the tables, and it did detect the columns ideally without any fine-tuning. But, it only detected the columns well and had difficulties with the multi-line rows, splitting each line into a different row.
E.g., in the PDF you will have columns 2 and 3. The table on Stackoverflow doesn't seem to allow multi-line content either, so I added row numbers. Just merge the row 1 in your head.
Row number | German | Latin |
---|---|---|
1 | First word | Translation for first word |
1 | with many lines of content | [phonetic vocabulary thingy] |
1 | and more lines | |
2 | Second word | Translation for second word |
Instead of fine-tuning the read_pdf parameters, are there ways around that?