XPath How to retrieve the value of a table cell from html document

Question

I have a html document and somewhere inside the doc is below a table, I can get the table rows and java DOM objects. What is not clear to me is how to extract the value of the table cell when the value is a string and also when it is a binary resource?

I am using code like:

  XPath xpath;
   XPathExpression expr;
   NodeList nodes=null;
   // Use XPath to obtain whatever you want from the (X)HTML
   try{

      xpath = XPathFactory.newInstance().newXPath();
      //<table class="data">

      NodeList list = doc.getElementsByTagName("table");
     // Node node = list.item(0); 
     //System.out.println(node.getTextContent());
    //String textContent=node.getTextContent();

    expr = xpath.compile("//table/tr/td");
    nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);

and loopiong like:

     for (int i = 0; i < nodes.getLength(); i++) {

       Node ln = list.item(i);
       String lnText=ln.toString();
       NodeList rowElements=ln.getChildNodes();
       Node one=rowElements.item(0);

       String oneText=one.toString();
       String nodeName=one.getNodeName();
       String valOne = one.getNodeValue();

But I am not seeing the values in the table.

 <table class="data">
 <tr><td>ImageName1</td><td width="50"></td><td><img src="/images/036000291452" alt="036000291452" /></td></tr>
 <tr><td>ImageName2</td><td width="50"></td><td><img src="/images/36000291452" alt="36000291452" /></td></tr>
 <tr><td>Description</td><td></td><td>Time Magazine</td></tr>
 <tr><td>Size/Weight</td><td></td><td>14 Issues</td></tr>
 <tr><td>Issuing Country</td><td></td><td>United States</td></tr>
  </table>

Good question, +1. See my answer for a complete short and easy one-liner XPath solution to this problem. — Dimitre Novatchev, May 09 '11 at 03:09

score 1 · Accepted Answer · answered May 09 '11 at 03:06

1

This XPath expression:

/*/tr[1]/td[1]

selects the td element (in no namespace) that is the first child of the first tr child of the top element (table) of the provided XML document.

The XPath expression:

/*/tr[1]/td[2]

selects the td element (in no namespace) that is the second child of the first tr child of the top element (table) of the provided XML document.

In general:

/*/tr[$m]/td[$n]

selects the td element (in no namespace) that is the $n-th child of the $m-th tr child of the top element (table) of the provided XML document. Just replace $m and $n with the desired integer values.

You can use the standard XPath function string() to obtain their string value:

string(/*/tr[$m]/td[$n])

evaluates to the string value of the td element (in no namespace) that is the $n-th child of the $m-th tr child of the top element (table) of the provided XML document.

answered May 09 '11 at 03:06

Dimitre Novatchev

240,661
26
293
431

Now in terms of executing this expression I am using Java API. So I am executing as follows: – Androider May 09 '11 at 05:15
XPathExpression exp = xpath.compile("string(/*/tr[3]/td[1])"); String val =(String) exp.evaluate(doc, XPathConstants.STRING); – Androider May 09 '11 at 05:16
But I am not getting a string value back. Could you comment on the execution of this expression. – Androider May 09 '11 at 05:17
XPathExpression exp = xpath.compile("string(//*/tr[3]/td[1])"); Node val =(Node) exp.evaluate(doc, XPathConstants.NODE); returns a value but note the extra / and it is not a string value. – Androider May 09 '11 at 05:18
@Androider: I am not a Java programmer, you need to read, practice examples and understand these APIs. Also, the (lack of) results you get is perfectly explainable if the document is in a default namespace. You never showed a complete XML document. I expected as possibility that the document could be in a default namespace, this is why I always say "selects the `td` element (in no namespace)", because if there is a default namespace, none of these expressions selects anything. Please, present (edit your question) the complete (as small as possible) XML document. – Dimitre Novatchev May 09 '11 at 12:45

score -1 · Answer 2 · answered May 09 '11 at 02:13

-1

Use a path like "string(//td)" to get the string contents of each cell. For linked resources, you will need to use something like "//td/img/@src" to get the URLs, then canonicalize them relative to the source url, and fetch te resulting URL from the network.

answered May 09 '11 at 02:13

Tassos Bassoukos

16,017
2
36
40

ok. How exactly would one apply this xpath to the table. Lets say I Description : Time Magazine. – Androider May 09 '11 at 02:51
I mean my path is giving me a row of td's. But when I retreive the tds the value of the td is not a text value that can be printed. getTextValue getContentValue do not return the values. How do you index the cells using string("//td") thanks – Androider May 09 '11 at 02:53
string("//td") does not really help to get these out by index. I provided the exact table. I need to see an indexible way to do this. – Androider May 09 '11 at 02:59
XPath How to retrieve the value of a table cell from html document – Androider May 09 '11 at 03:00

XPath How to retrieve the value of a table cell from html document

2 Answers2

Linked