Extract article's headline from HTML(using Boilerpipe)

Question

Boilerpipe allows to extract just the article's text from webpage, cleaning up all the HTML mess. However, how could I extract article's headline? There is a a way to just use page's title, but it is sometimes incorrect and contains unneeded words(e.g. "title - sitename").

Another idea is to find text between <h1> and </h1>, but I still thought I would ask some more solutions.

score 0 · Answer 1 · answered Oct 21 '16 at 09:33

0

Are you writing a web crawler? I think the difficulty is that you need to know where the title is in a whole html. For most website they have a unique pattern for writing html, it should be known before the crawler being written.

answered Oct 21 '16 at 09:33

Daisy Li

54
6

Yeah, kinda, only the headline extraction part is needed – Gintas_ Oct 21 '16 at 09:40
So the structure of html is very important. Hence websites have different structures. It's certainly a time exhausting work... – Daisy Li Oct 21 '16 at 09:43

Extract article's headline from HTML(using Boilerpipe)

1 Answers1