I am trying to display .msg files (i.e. Outlook emails) in my Web Application using JSP. I am using the parser http://auxilii.com/msgparser/ which extracts the body content of the email which is stored as RTF (sometimes or always - I haven't checked)
The parser itself comes with two converters from RTF to HTML, SimpleRTF2HTMLConverter
(which doesn't work at all for me) and JEditorPaneRTF2HTMLConverter
(which works but doesn't convert the Hebrew text properly but just displays question marks)
Is there anyway of tweaking the JEditorPaneRTF2HTMLConverter
code (reproduced below) for UniCode in general (and Hebrew specifically)?
public class JEditorPaneRTF2HTMLConverter implements RTF2HTMLConverter {
public String rtf2html(String rtf) throws Exception {
JEditorPane p = new JEditorPane();
p.setContentType("text/rtf");
EditorKit kitRtf = p.getEditorKitForContentType("text/rtf");
try {
StringReader rtfReader = new StringReader(rtf);
kitRtf.read(rtfReader, p.getDocument(), 0);
kitRtf = null;
EditorKit kitHtml = p.getEditorKitForContentType("text/html");
Writer writer = new StringWriter();
kitHtml.write(writer, p.getDocument(), 0, p.getDocument().getLength());
return writer.toString();
} catch (Exception e) {
throw new Exception("Could not convert RTF to HTML.", e);
}
}
}
As an example. In the original email, there is a telephone number - note the two Hebrew letters which are an abbreviation for טלפון (telephone)
טל: 02-9999999
In the RTF that is input to this function it looks like this
\pard\qr\plain{\f3\rtlch\lang13\cf2\fs20 \'E8\'EC': 02-9999999}\par
In the HTML that is output from this function it looks like this
<p class=default>
<span style="color: #808080; font-size: 10pt; font-family: Arial">
鬧: 02-9999999
</span>
<span style="color: #000000; font-size: 12pt; font-family: Times New Roman">
</span>
</p>
The character appearing 鬧 here in StackOverflow appears in NotePad++ as xE8xEC (in inverted characters) whereas in my web application it is rendered as ??. [Note Hebrew is displayed correctly in my application if I just take the body of the email without the formatting.]