Retain boilerplate using boilerpipe

Question

I am using boilerpipe library to analyzer news articles. There news articles contain a lot of boilerplate such as copyright information, side pane of related articles, etc. Boilerpipe removes all that information. Is it possible to return the boilerplate information? I need to analyzer and extract some stuff from copyright statement, etc.

Also, does it contains some sort of confidence for each text block as to whether it is boilerplate or not?

Thanks.

score 1 · Answer 1 · answered Oct 21 '13 at 08:57

You can get the entire text or traverse the actual text blocks by using the Document classes boilerplate provides:

final HTMLDocument htmlDoc = HTMLFetcher.fetch(new URL(url));
final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
// doc.getText(true, true) will give you all the text
// doc.getTextBlocks will let you traverse the document

Retain boilerplate using boilerpipe

1 Answers1