1

I am trying to learn the usage of Xpath expressions with Java. I am using Jtidy to convert the HTML page to XHTML so that I can easily parse it using XPath expressions. I have the following code:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);


DocumentBuilder builder = factory.newDocumentBuilder();
    Document doc = ConvertXHTML("https://twitter.com/?lang=fr");

//Create XPath

XPathFactory xpathfactory = XPathFactory.newInstance();
XPath Inst= xpathfactory.newXPath();
NodeList nodes = (NodeList)Inst.evaluate("//p/@align",doc,XPathConstants.NODESET);
    for (int i = 0; i < nodes.getLength(); ++i) 
   {
            Element e = (Element) nodes.item(i);
            System.out.println(e);
    }

public Document ConvertXHTML(String link){
  try{

      URL u = new URL(link);

     BufferedInputStream instream=new BufferedInputStream(u.openStream());
     FileOutputStream outstream=new FileOutputStream("out.xhtml");

     Tidy c=new Tidy();
     c.setShowWarnings(false);
     c.setInputEncoding("UTF-8");
     c.setOutputEncoding("UTF-8");
     c.setXHTML(true);

     return c.parseDOM(instream,outstream);
     }

It's working fine for most URLs but this one :

https://twitter.com/?lang=fr

I am getting this exception because of it:

javax.xml.transform.TransformerException: Index -1 out of bounds.....

Below is a part of stack trace I am getting:

javax.xml.transform.TransformerException: Index -1 out of bounds for length 128
at java.xml/com.sun.org.apache.xpath.internal.XPath.execute(XPath.java:366)
at java.xml/com.sun.org.apache.xpath.internal.XPath.execute(XPath.java:303)
at java.xml/com.sun.org.apache.xpath.internal.jaxp.XPathImplUtil.eval(XPathImplUtil.java:101)
at java.xml/com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.eval(XPathExpressionImpl.java:80)
at java.xml/com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:89)
at files.ExampleCode.GetThoselinks(ExampleCode.java:50)
at files.ExampleCode.DoSomething(ExampleCode.java:113)
at files.ExampleCode.GetThoselinks(ExampleCode.java:81)
at files.ExampleCode.DoSomething(ExampleCode.java:113)

I am not sure whether the problem is in the converted xhtml of the website or something else. Can anyone tell what is wrong in the code? Any edits would be helpful.

A Beginner
  • 393
  • 2
  • 12

2 Answers2

0

I would normally say that an index-of-bounds exception coming from deep within the XPath engine is a bug in the XPath engine. The only caveat is if there's something structurally wrong with the DOM that the XPath engine is searching; an XPath processor is entitled to make reasonable assumptions that the DOM is valid and to crash if it isn't. In that case it would be a bug in Tidy, which created the DOM.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Let's say the problem is in Tidy and it didn't give me a proper XHTML. Is there any way I can do a check in Xpath so that it can avoid empty nodes evaluation? – A Beginner Nov 05 '18 at 10:22
  • I think my strategy at this stage would be either (a) to build a project containing the source code of the Apache XPath and Tidy projects, try to reproduce the crash, and debug it, or (b) to switch to alternative libraries, e.g. validator.nu instead of Tidy, Saxon or Jaxen in place of Apache XPath. You could also try (c) getting help from those who support the libraries, but in the case of the XPath libraries I won't hold my breath. – Michael Kay Nov 05 '18 at 15:24
0

I had a similar problem using xpath evaluation on a document produced by JTidy. I got around it by having JTidy serialize the DOM it produced to a file, and then parsing that xml file with javax.xml.parsers.DocumentBuilder to get a 2nd version of the DOM. Bizarre as it seems, using the 2nd one avoided the out of bounds exception and worked. Use code like the following:

        DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
        documentBuilderFactory.setNamespaceAware(true);
        // If you don't do the following, it will take a full minute to parse the xml document (presumably the time-out
        // period for trying to load the DTD). See https://stackoverflow.com/questions/6204827/xml-parsing-too-slow.
        documentBuilderFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
        documentBuilder = documentBuilderFactory.newDocumentBuilder();
        Document doc = tidy.parseDOM(input, null);
        FileOutputStream fos = new FileOutputStream("temp.xml");
        tidy.pprint(doc, fos);
        fos.close();
        doc = documentBuilder.parse("temp.xml");

user3969107
  • 27
  • 1
  • 5