Tika with Python to parse PDF text is not able to handle vertical text. Any ideas?

Asked Aug 27 '20 at 09:51

Active Sep 06 '20 at 07:14

Viewed 603 times

I am trying to extract text from a PDF using Python Tika library. The library is picking up text in the sequence I want. However, it is not able to handle vertically aligned text.

For example, the word,

is read as:

V
al

ue
s

There are many other such instances where vertical text is not parsed correctly. I tried using other libraries like pypdf2, pdfminer3, pdfplumber, etc.. but for most of them the sequence of the text is not correctly ordered. Tika was able to give the best result.

Any ideas about how this be fixed? I used the simplest code as follows:

from tika import parser
file_data = parser.from_file("sample.pdf")
text = file_data['content']
print(text)

Are there any other optional parameters I should be aware of?

edited Sep 06 '20 at 07:14

marc_s

732,580
175
1,330
1,459

asked Aug 27 '20 at 09:51

user3865019

Tika with Python to parse PDF text is not able to handle vertical text. Any ideas?

0 Answers0