How to decode special characters with Apache Tika

Asked Aug 27 '13 at 21:24

Active Sep 02 '13 at 22:03

Viewed 718 times

I'm using Apache Tika to parse some MS Word documents to HTML (String). Problem is that some documents contains special characters (e.g. Mathematical Operators). Is any way how to solve it? Thank you for help.

Input: enter image description here

Output

enter image description here

Source Code

SAXTransformerFactory.newInstance();
TransformerHandler handler = null;

try {
  handler = factory.newTransformerHandler();
} catch (TransformerConfigurationException e) {
   logger.warn(String.format("SAX Processing is not available: ", e));
   return;
}

handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "UTF-8");
handler.setResult(new StreamResult(output)); // StringWriter output

edited Sep 02 '13 at 22:03

asked Aug 27 '13 at 21:24

Peter Jurkovic

2,686
6
36
55

Have you tried using the same font to display the HTML as Microsoft uses in Office when encoding the text with non-standard glyphs? – Gagravarr Aug 27 '13 at 23:09
Yes, in Office doc is common font Arial. – Peter Jurkovic Aug 28 '13 at 06:21

How to decode special characters with Apache Tika

0 Answers0