I am parsing an XML file using Apache Tika. I would like to extract certain tags with their content from the XML and store them in a HashMap. Right now, i can extract the entire content of the XML but the tags are lost
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = null;
try
{
inputstream = new FileInputStream(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()));
}
catch (URISyntaxException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
ParseContext pcontext = new ParseContext();
//Xml parser
XMLParser xmlparser = new XMLParser();
xmlparser.parse(inputstream, handler, metadata, pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
which shows me the entire content of the XML
now, i want to extract certain parts of the XML, and since Tika allows XPath queries, i tried this
XPathParser xhtmlParser = new XPathParser("xhtml", XHTMLContentHandler.XHTML);
Matcher divContentMatcher = xhtmlParser.parse("/Product/Source/Publisher/PublisherName[@nameType='Person']");
ContentHandler xhandler = new MatchingContentHandler(
new ToXMLContentHandler(), divContentMatcher);
AutoDetectParser parser = new AutoDetectParser();
Metadata xmetadata = new Metadata();
try (FileInputStream stream = new FileInputStream(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()))) {
parser.parse(stream, xhandler, xmetadata);
System.out.println(xhandler.toString());
} catch (URISyntaxException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
but it does not show any output! i was hoping it would only give me the nodes specified in the XQuery.
Any idea what's going on?
by the way, here is the corresponding XML
<Product productID="xvc22" shortProductID="x" language="en">
<ProductStatus statusType="Published" />
<Source>
<Publisher sequence="1" primaryIndicator="Yes">
<PublisherID idType="Shortname">jjkjkj</PublisherID>
<PublisherID idType="BM">6666</PublisherID>
<PublisherName nameType="Legal">ABT</PublisherName>
<PublisherName nameType="Person">
<LastName>pppp</LastName>
<FirstName>lkkk</FirstName>
</PublisherName>
</Publisher>
</Source>
</Product>
also, when i test the query on
http://www.freeformatter.com/xpath-tester.html
i see the correct result i.e.
Element='<PublisherName nameType="Person">
<LastName>pppp</LastName>
<FirstName>lkkk</FirstName>
</PublisherName>'
is this some syntax issue with JAVA or Tika?
EDIT
Note that if i parse without Tika, it works
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()));
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile("/Product/Source/Publisher/PublisherName[@nameType='Person']");
System.out.println(expr.evaluate(doc, XPathConstants.STRING));
this prints out
pppp
lkkk
which is perfect. so why cant Tika parse the XPath query?