0

I generate html documents that contain a menu and a content part. Then I want to extract the content of these document to feed it to a lucene index. However, I would like to exclude the menu from the content extraction and thus only index the content.

<div class="menu">my menu goes here</div>
<div class="content">my content goes here</div>

what is the simplest way to achieve this with apache tika?

Trinadh Gupta
  • 306
  • 5
  • 18
bertolami
  • 2,896
  • 2
  • 24
  • 41

3 Answers3

3

As a more general solution (not just for you specific menu) I would advise looking at boilerpipe that deals with removing uninteresting parts from pages (menus, navigation etc).

I know it can be integrated in Solr/tika, have a look and you probably can integrate it in your scenario.

Persimmonium
  • 15,593
  • 11
  • 47
  • 78
1

Have a look at this post which specifies how to handle DIVs during the HTML parse, by specifying whether they are safe to parse or not, in which case its ignored. For your problem, you could have some logic in the override methods which ignore only DIV elements with attribute value "menu" (i.e. tell TIKA parser this DIV is unsafe to parse).

Community
  • 1
  • 1
0

You can parse the html with a parser to a xhtml dom object an remove the div tag cotaining the attribute class="menu".

fatih
  • 1,395
  • 10
  • 9
  • Can you give me a little bit more information. Do you mean a Tika parser or any other DOM parser? I thought that Tika works with SAX parsers. – bertolami Jan 15 '14 at 12:52
  • yes, just use new HtmlParser().parse(..) of apache tika with a SAX handler. – fatih Jan 15 '14 at 13:04