Exclude menu from content extraction with tika

Question

I generate html documents that contain a menu and a content part. Then I want to extract the content of these document to feed it to a lucene index. However, I would like to exclude the menu from the content extraction and thus only index the content.

<div class="menu">my menu goes here</div>
<div class="content">my content goes here</div>

what is the simplest way to achieve this with apache tika?

score 3 · Answer 1 · answered Jan 16 '14 at 08:35

As a more general solution (not just for you specific menu) I would advise looking at boilerpipe that deals with removing uninteresting parts from pages (menus, navigation etc).

I know it can be integrated in Solr/tika, have a look and you probably can integrate it in your scenario.

score 1 · Answer 2 · edited May 23 '17 at 12:09

1

Have a look at this post which specifies how to handle DIVs during the HTML parse, by specifying whether they are safe to parse or not, in which case its ignored. For your problem, you could have some logic in the override methods which ignore only DIV elements with attribute value "menu" (i.e. tell TIKA parser this DIV is unsafe to parse).

edited May 23 '17 at 12:09

Community

1
1

answered Jan 16 '14 at 10:36

user2683129

79
3

score 0 · Answer 3 · answered Jan 15 '14 at 12:40

0

You can parse the html with a parser to a xhtml dom object an remove the div tag cotaining the attribute class="menu".

answered Jan 15 '14 at 12:40

fatih

1,395
10
9

Can you give me a little bit more information. Do you mean a Tika parser or any other DOM parser? I thought that Tika works with SAX parsers. – bertolami Jan 15 '14 at 12:52
yes, just use new HtmlParser().parse(..) of apache tika with a SAX handler. – fatih Jan 15 '14 at 13:04

Exclude menu from content extraction with tika

3 Answers3