0

I want to extract the values of single quoted html attributes using Xpath. I have used JTidy to clean the html doc and my code looks like this:

try {
    String data = string.toString();
    InputStream input = new ByteArrayInputStream(data.getBytes());
    Document document = new Tidy().parseDOM(input, null);

    XPathFactory factory = XPathFactory.newInstance();
    XPath xPath = factory.newXPath();
    XPathExpression expr = xPath.compile("//a[@class='swatch-2011-link']/@color");

    Object evaluate = expr.evaluate(document, XPathConstants.NODESET);
    NodeList list = (NodeList) evaluate;
    System.out.println(list.getLength());
    for (int i = 0; i < list.getLength(); i++) {
        String name = list.item(i).getNodeValue();
        System.out.println(name);
    }
}
catch (XPathExpressionException e) {
    e.printStackTrace();
}

<a class="swatch-2011-link" 
   style='background:url(somelink); background-size:26px 26px; filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://media.plussizetech.com/womanwithin/zs/0037_19561_zs_2835.jpg, sizingMethod=scale)' mainimageUrl='http://media.plussizetech.com/womanwithin/mc/0037_19561_mc_2835.jpg?wid=271&amp;hei=388&amp;qlt=95&amp;op_sharpen=1' colorName='WILD LIME WHITE'/>
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • And does it work? If not, how does it fail? – Michael Kay Jul 24 '13 at 09:09
  • In your example, there is no attribute `color` (which you are selecting in your XPath), only `colorName`. – tfoo Jul 24 '13 at 09:13
  • It fails, i get an empty nodelist. And that 'color' in the Xpath expression is a typo, it fails even with colorName. – user2613481 Jul 25 '13 at 06:10
  • I am not able to extract any of the attributes for which the values of the attributes are in single quotes. In this case, mainimageUrl, colorname. – user2613481 Jul 25 '13 at 06:13
  • I tried cleaning the doc with both Jtidy and HTMLCleaner. Fails in both cases. – user2613481 Jul 25 '13 at 06:13
  • It's only a guess, but I think it might have something to do with namespaces. If JTidy creates a document in the XHTML namespace, your XPath must also use this namespace. To check, you should simply replace the second `null` parameter of the `parseDOM` call with ``System.out`` and see if the printed XML has a namespace declaration. – obecker Jul 25 '13 at 19:35
  • I am able to extract the values of other double quoted attributes from the same HTML page. So that isn't the problem I think. – user2613481 Jul 27 '13 at 13:55

0 Answers0