I'm using Apache Tika to parse some MS Word documents to HTML (String). Problem is that some documents contains special characters (e.g. Mathematical Operators). Is any way how to solve it? Thank you for help.
Input:
Output
Source Code
SAXTransformerFactory.newInstance();
TransformerHandler handler = null;
try {
handler = factory.newTransformerHandler();
} catch (TransformerConfigurationException e) {
logger.warn(String.format("SAX Processing is not available: ", e));
return;
}
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "UTF-8");
handler.setResult(new StreamResult(output)); // StringWriter output