1

NBoilerPipe is a Mono port of the BoilerPipe Java library. I've managed to get this working in .NET 4 without too much trouble (a few library references needed fixing/etc). However, searching through the code, I cannot find any 'hooks' for HTML output. For example, the GetText() method only has one parameter for the input, and I cannot see any additional methods. How can I get HTML output from NBoilerPipe?

Here is the sample NBoilerPipe code that is working in .NET4:

          String url = "http:// <etc> ";
        String page = String.Empty;
        WebRequest request = WebRequest.Create (url);
        HttpWebResponse response = (HttpWebResponse)request.GetResponse ();
        Stream stream = response.GetResponseStream ();
        using (StreamReader streamReader = new StreamReader (stream, Encoding.UTF8)) {
            page = streamReader.ReadToEnd ();
        }           
        String text = ArticleExtractor.INSTANCE.GetText (page);
        Console.WriteLine ("Text: \n" + text);
winwaed
  • 7,645
  • 6
  • 36
  • 81
  • Isn't the purpose of NBoilerPipe to extract the text from html? I'm not sure I understand what you are trying to do. – happy coder Jan 20 '13 at 07:32
  • Boilerpipe extracts the content from the page, filtering the 'boilerplate' - things like header, footer, menus, advertising,etc. The original BoilerPipe can return the content as HTML fragments, or filtered further to give text. THe HTML fragments are useful because they include things like p tags. – winwaed Jan 21 '13 at 12:55

2 Answers2

0

I had the same issue. I managed to solve it by using the following.

http://boilerpipe-web.appspot.com/

Mifla
  • 29
  • 5
  • Note that [link-only answers](http://meta.stackoverflow.com/tags/link-only-answers/info) are discouraged, SO answers should be the end-point of a search for a solution (vs. yet another stopover of references, which tend to get stale over time). Please consider adding a stand-alone synopsis here, keeping the link as a reference. – kleopatra Oct 08 '13 at 11:28
  • Thanks for the reply. The above link is a free request-limited web service to the Java library. Only suitable for home experimentation imho. – winwaed Oct 08 '13 at 13:34
0

I know this is an old question, and I'm not familiar with .Net (though it looks like Java to me), and I'm also not an expert programmer by any means, but I think this may help others with a similar question.

The INSTANCE method you're using returns only text. If you want to get HTML you need to create a BoilerpipeExtractor and an HTMLHighlighter. Then you can use its process method to get what you're looking for.

final BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR;
final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance();

The .newExtractingInstance() is the one that gives you just the relevant HTML. The other option is .newHighlightingInstance(), which highlights the main text but keeps the whole HTML document intact.

All you need to do after that is to call the HTMLHighlighter's process method:

System.out.println(hh.process(url, extractor));

process can also be process(TextDocument doc, InputSource is) or process(TextDocument doc, String origHTML).

Look through the source code in the Github repo. There are notes on what everything does. I looked for the Javadocs, but I can't find them anymore.

Find a demonstration of pretty much exactly this at HTMLHighlightDemo in the same repo.

kGdmioT
  • 208
  • 2
  • 6