Boilerpipe allows to extract just the article's text from webpage, cleaning up all the HTML mess. However, how could I extract article's headline? There is a a way to just use page's title, but it is sometimes incorrect and contains unneeded words(e.g. "title - sitename").
Another idea is to find text between <h1>
and </h1>
, but I still thought I would ask some more solutions.