I am trying to extract text from a PDF using Python Tika library. The library is picking up text in the sequence I want. However, it is not able to handle vertically aligned text.
For example, the word,
is read as:
V
al
ue
s
There are many other such instances where vertical text is not parsed correctly. I tried using other libraries like pypdf2, pdfminer3, pdfplumber, etc.. but for most of them the sequence of the text is not correctly ordered. Tika was able to give the best result.
Any ideas about how this be fixed? I used the simplest code as follows:
from tika import parser
file_data = parser.from_file("sample.pdf")
text = file_data['content']
print(text)
Are there any other optional parameters I should be aware of?