Handle ligatures in Apache Tika

Asked Mar 12 '14 at 10:30

Active Mar 14 '14 at 14:50

Viewed 638 times

Tika doesn't seem to recognize ligatures (fi, ff, fl...) in PDF files and replaces them with question marks.

Any idea (not only on Tika) to extract PDF text while converting character ligatures to separated characters ?

File file = new File("path/to/file.pdf");
String text = Tika().parseToString(file);

Edit

My PDF file is UTF-8 encoded (that's what InputStream.getEncoding() says), my platform encoding is also UTF-8. Even with a -Dfile.encoding=UTF8, it is not working.

For instance, I'm supposed to have : "différentes implémentations" ...and that's what I really get : "di��erentes impl�ementations"

edited Mar 14 '14 at 14:50

asked Mar 12 '14 at 10:30

Spadon_

Is there a chance that those characters are not in your working charset? – AlexR Mar 12 '14 at 10:37
It seems ok ; however, a Tika changelog says : "Invalid characters are now replaced with the Unicode replacement character (U+FFFD)" i.e., question marks. I tried the same operation with Snowtide's PDFTextStream and those ligatures are replaced with spaces instead. – Spadon_ Mar 12 '14 at 10:55
What are you doing with your `text` object after parsing it? If you output it anywhere, you need to ensure that that output is in the right encoding, and whatever you display it with supports those codepoints! – Gagravarr Mar 12 '14 at 19:23
I convert my String into a JSONObject (in order to use it as a post request for ElasticSearch's indexing). fyi edit: Detected encoding is UTF-8 ; my platform encoding is UTF-8. – Spadon_ Mar 14 '14 at 13:26
And btw, I have the same issue with U+0065 & U+0301 combined char that gives "é". I don't know if it helps, but this PDF file was originally written in LaTeX and encoded with MiKTeX-xdvipdfmx (0.7.8) – Spadon_ Mar 14 '14 at 13:36
I decided to use node-tika npm package instead. It works. – Spadon_ Mar 27 '14 at 15:13

Handle ligatures in Apache Tika

0 Answers0

Linked