4

I am parsing an XML file using Apache Tika. I would like to extract certain tags with their content from the XML and store them in a HashMap. Right now, i can extract the entire content of the XML but the tags are lost

  //detecting the file type
  BodyContentHandler handler = new BodyContentHandler();

  Metadata metadata = new Metadata();
  FileInputStream inputstream = null;

try 
{
    inputstream = new FileInputStream(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()));
}
catch (URISyntaxException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}

ParseContext pcontext = new ParseContext();

  //Xml parser
  XMLParser xmlparser = new XMLParser(); 
  xmlparser.parse(inputstream, handler, metadata, pcontext);
  System.out.println("Contents of the document:" + handler.toString());
  System.out.println("Metadata of the document:");
  String[] metadataNames = metadata.names();

  for(String name : metadataNames) {
     System.out.println(name + ": " + metadata.get(name));

  }

which shows me the entire content of the XML

now, i want to extract certain parts of the XML, and since Tika allows XPath queries, i tried this

XPathParser xhtmlParser = new XPathParser("xhtml", XHTMLContentHandler.XHTML);
      Matcher divContentMatcher = xhtmlParser.parse("/Product/Source/Publisher/PublisherName[@nameType='Person']");
      ContentHandler xhandler = new MatchingContentHandler(
              new ToXMLContentHandler(), divContentMatcher);

      AutoDetectParser parser = new AutoDetectParser();
      Metadata xmetadata = new Metadata();
      try  (FileInputStream stream = new FileInputStream(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()))) {
          parser.parse(stream, xhandler, xmetadata);
          System.out.println(xhandler.toString());
      } catch (URISyntaxException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
   }

but it does not show any output! i was hoping it would only give me the nodes specified in the XQuery.

Any idea what's going on?

by the way, here is the corresponding XML

<Product productID="xvc22" shortProductID="x" language="en">
  <ProductStatus statusType="Published" /> 
   <Source>
  <Publisher sequence="1" primaryIndicator="Yes">
  <PublisherID idType="Shortname">jjkjkj</PublisherID> 
  <PublisherID idType="BM">6666</PublisherID> 
  <PublisherName nameType="Legal">ABT</PublisherName> 
  <PublisherName nameType="Person">
  <LastName>pppp</LastName> 
  <FirstName>lkkk</FirstName> 
  </PublisherName>
  </Publisher>
  </Source>
  </Product>

also, when i test the query on

http://www.freeformatter.com/xpath-tester.html

i see the correct result i.e.

Element='<PublisherName nameType="Person">
  <LastName>pppp</LastName>
  <FirstName>lkkk</FirstName>
</PublisherName>'

is this some syntax issue with JAVA or Tika?

EDIT

Note that if i parse without Tika, it works

      DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
      DocumentBuilder builder = factory.newDocumentBuilder();
      Document doc = builder.parse(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()));
      XPathFactory xPathfactory = XPathFactory.newInstance();
      XPath xpath = xPathfactory.newXPath();
      XPathExpression expr = xpath.compile("/Product/Source/Publisher/PublisherName[@nameType='Person']");

      System.out.println(expr.evaluate(doc, XPathConstants.STRING));

this prints out

pppp
lkkk

which is perfect. so why cant Tika parse the XPath query?

AbtPst
  • 7,778
  • 17
  • 91
  • 172
  • 1
    You appear to be asking Tika for the plain-text version of your document, which is unsurprisingly why the tags are removed. What happens if you ask Tika for the XHTML version of your document instead? – Gagravarr Nov 09 '15 at 16:20
  • thanks, please see the edit. is that what you were talking about? – AbtPst Nov 09 '15 at 16:24
  • please see the edit. i have made a few changes – AbtPst Nov 09 '15 at 22:35

0 Answers0