How to get the main content of an article from HTML using boilerplate?

Question

I am trying to get the main content of an article from an HTML using boilerpipe code.

Downloaded the latest jars from here.

I am trying to use the following code:

String article = "";
try {
    article = ArticleExtractor.INSTANCE.getText(url);   
    System.out.println("Article ++++ >>" + article);    
} catch (BoilerpipeProcessingException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}

But this returns an empty string for every URL. Can anyone help me on this?

In order to [ask a good question](http://stackoverflow.com/help/how-to-ask) you should include that information, and the `url` you're querying in the description of your problem, to create [a Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve), — Markus Mitterauer, Oct 10 '16 at 07:00
Have you tried to pass the HTML itself instead of the url? Or maybe there is a problem with the way your url strings are formatted. Can you show us some examples of url strings you tried? — Luca Angioloni, Oct 10 '16 at 07:08
Nobody can answer this question because we don't know the input. And even then all we could do is debug it ourselves. If you have access to boilerpipe's sources, that's what you should do. — f1sh, Oct 10 '16 at 07:13
@LucaAngioloni Yes, you are right. Now it works. Can you post that as an answer. I would accept it. — Pritam Banerjee, Oct 10 '16 at 07:14

score 2 · Accepted Answer · answered Oct 10 '16 at 07:18

2

Have you tried to pass the HTML itself instead of the url? Or maybe there is a problem with the way your url strings are formatted.

answered Oct 10 '16 at 07:18

Luca Angioloni

2,243
2
19
28

How to get the main content of an article from HTML using boilerplate?

1 Answers1