2

I am trying to display .msg files (i.e. Outlook emails) in my Web Application using JSP. I am using the parser http://auxilii.com/msgparser/ which extracts the body content of the email which is stored as RTF (sometimes or always - I haven't checked)

The parser itself comes with two converters from RTF to HTML, SimpleRTF2HTMLConverter (which doesn't work at all for me) and JEditorPaneRTF2HTMLConverter (which works but doesn't convert the Hebrew text properly but just displays question marks)

Is there anyway of tweaking the JEditorPaneRTF2HTMLConverter code (reproduced below) for UniCode in general (and Hebrew specifically)?

  public class JEditorPaneRTF2HTMLConverter implements RTF2HTMLConverter {

        public String rtf2html(String rtf) throws Exception {
            JEditorPane p = new JEditorPane();
            p.setContentType("text/rtf");
            EditorKit kitRtf = p.getEditorKitForContentType("text/rtf");
            try {
                StringReader rtfReader = new StringReader(rtf);
                kitRtf.read(rtfReader, p.getDocument(), 0);
                kitRtf = null;
                EditorKit kitHtml = p.getEditorKitForContentType("text/html");
                Writer writer = new StringWriter();
                kitHtml.write(writer, p.getDocument(), 0, p.getDocument().getLength());
                return writer.toString();
            } catch (Exception e) {
                throw new Exception("Could not convert RTF to HTML.", e);
            }
        }

    }

As an example. In the original email, there is a telephone number - note the two Hebrew letters which are an abbreviation for טלפון (telephone)

טל: 02-9999999

In the RTF that is input to this function it looks like this

\pard\qr\plain{\f3\rtlch\lang13\cf2\fs20 \'E8\'EC': 02-9999999}\par

In the HTML that is output from this function it looks like this

<p class=default>
      <span style="color: #808080; font-size: 10pt; font-family: Arial">
        鬧: 02-9999999
      </span>
      <span style="color: #000000; font-size: 12pt; font-family: Times New Roman">

      </span>
    </p>

The character appearing 鬧 here in StackOverflow appears in NotePad++ as xE8xEC (in inverted characters) whereas in my web application it is rendered as ??. [Note Hebrew is displayed correctly in my application if I just take the body of the email without the formatting.]

gordon613
  • 2,770
  • 12
  • 52
  • 81
  • Have you checked that `String rtf` does not already contain question marks for the Hebrew characters? – vanOekel Jan 14 '14 at 14:44
  • Thank you vanOekel for your comment. It does not, and I have updated my question accordingly. – gordon613 Jan 14 '14 at 15:10
  • The ?? is a symptom of bad character set conversion, i.e. the bytes read by your web-application do not match with a character in the character set used by your web-application. Check that the character set in the HTML (e.g. in header ``) matches with how you write (e.g. which character set you use) the String returned by `rtf2html`. – vanOekel Jan 14 '14 at 15:32
  • It seems to be OK. And anyway it is only the Hebrew which is returned from this function which is displayed incorrectly on the web page. The Hebrew which is taken as plain text (i.e. without the formatting) is displayed correctly on the web page. – gordon613 Jan 14 '14 at 17:34
  • I wonder whether the issue is changing the character encoding of `JEditorPane` (see http://docs.oracle.com/javase/7/docs/api/javax/swing/JEditorPane.html#setContentType(java.lang.String) ) but I wasn't successful – gordon613 Jan 14 '14 at 17:35

0 Answers0