BodyContentHandler is a decorating ContentHandler, as detailed in the javadocs. All it does is filter out SAX events, so that the downstream handler just gets the body contents. However, if you create it without any arguments it'll internally create a WriteOutContentHandler
for you with a 100k limit.
To get the body itself, you'll need to ask whatever handler you passed to BodyContentHandler
to get it. If you just want the plain text, and won't hit the default character limit, go for something like:
BodyContentHandler bch = new BodyContentHandler();
parser.parse(is, bch, metadata, new ParseContext());
String plainText = bch.toString();
If you want to get the XHTML of the body, you'll want something more like:
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "no");
handler.setResult(new StreamResult(sw));
BodyContentHandler bch = new BodyContentHandler(handler);
parser.parse(is, bch, metadata, new ParseContext());
String xhtml = sw.toString();