0

The source code of the MailContentHandler has this:

try {
            BodyContentHandler bch = new BodyContentHandler(handler);
            parser.parse(is, new EmbeddedContentHandler(bch), submd, context);

I would like to read the body content as a string at this point and add some metadata in if found/matched/generated as I want... I don't seem to be able to call toString on the BodyContentHandler object.

If anyone is familiar with tika, and creating or altering the existing parses please point me in the right direction.

Chris
  • 923
  • 1
  • 8
  • 11
  • What's the `handler` object you're passing in? And can't you get the body out from that? – Gagravarr Apr 11 '13 at 15:09
  • Here is the source code I have been modifying: http://svn.apache.org/repos/asf/tika/branches/1.2/tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java I have established it has to do with the BodyContenHandler taking a handler constructor. I just dont know how to get the body text, which I require – Chris Apr 11 '13 at 15:13
  • and to answer your question the handler object is XHTMLContentHandler – Chris Apr 11 '13 at 15:17

1 Answers1

1

BodyContentHandler is a decorating ContentHandler, as detailed in the javadocs. All it does is filter out SAX events, so that the downstream handler just gets the body contents. However, if you create it without any arguments it'll internally create a WriteOutContentHandler for you with a 100k limit.

To get the body itself, you'll need to ask whatever handler you passed to BodyContentHandler to get it. If you just want the plain text, and won't hit the default character limit, go for something like:

BodyContentHandler bch = new BodyContentHandler();
parser.parse(is, bch, metadata, new ParseContext());
String plainText = bch.toString();

If you want to get the XHTML of the body, you'll want something more like:

StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)
             SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "no");
handler.setResult(new StreamResult(sw));

BodyContentHandler bch = new BodyContentHandler(handler);

parser.parse(is, bch, metadata, new ParseContext());

String xhtml = sw.toString();
Gagravarr
  • 47,320
  • 10
  • 111
  • 156
  • So, I don't want to change the existing MailContentHandler functionality, I just want to look at the content during the body callback and add extra data to the metadata object if necessary. – Chris Apr 11 '13 at 15:30
  • Why not pass in your own Parser class, and have that wrap the real parser and monkey with the metadata there? – Gagravarr Apr 11 '13 at 15:32
  • Could you point me to any example of what you're suggesting, I'm not sure how to do that, or how I get the body/contents of the XHTMLContentHandler – Chris Apr 11 '13 at 15:43
  • Parser wrappers would warrant a new question, as it's a very different thing to playing with the content handlers – Gagravarr Apr 11 '13 at 15:55