0

I'm using Apache Tika to parse some MS Word documents to HTML (String). Problem is that some documents contains special characters (e.g. Mathematical Operators). Is any way how to solve it? Thank you for help.

Input: enter image description here

Output

enter image description here

Source Code

SAXTransformerFactory.newInstance();
TransformerHandler handler = null;

try {
  handler = factory.newTransformerHandler();
} catch (TransformerConfigurationException e) {
   logger.warn(String.format("SAX Processing is not available: ", e));
   return;
}

handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "UTF-8");
handler.setResult(new StreamResult(output)); // StringWriter output
Peter Jurkovic
  • 2,686
  • 6
  • 36
  • 55

0 Answers0