4

I am trying build custom xpath contentHandler for tika that recognizes complex xpath expression, by using code from org/apache/tika/sax/BodyContentHandler.java (because I am using tika for other stuff)

This xpath works

/xhtml:html/xhtml:body/descendant:node()

but this does not

//xhtml:div[@id='someid']/descendant:node()

I want to integrate tika's contentHandler (because it fixes html contents unbalanced tags and invalid character) with xpath evaluator from javax.xml.xpath. What is a proper way of doing that. Is there a way I can get inputsource once tika has evaluated and fixed html content?

surajz
  • 3,471
  • 3
  • 32
  • 38

1 Answers1

2

The XPath feature included in Tika only supports a subset of XPath features (see XPathParser for details). For more complex XPath queries I recommend using something like javax.xml.xpath.

Jukka Zitting
  • 1,092
  • 6
  • 13