0

I'm using the COBRA HTMLParser but haven't had luck parsing one particular tag. Here's the source:

<li id="eta" class="hentry">
  <span class="body">
    <span class="actions">
    </span>
    <span class="content">
    </span>
    <span class="meta entry">Content here
    </span>
    <span class="meta entry stub">Content here
    <span class="shared-content">
      Information by
      <a class="title" data="associate" href="/associate">Associate</a>
    </span>
    </span>
  </span>
</li>

I am able to use the following XPaths to get the proper information:

            XPath xpath = XPathFactory.newInstance().newXPath();
            NodeList nodeList = (NodeList) xpath.evaluate("//span[contains(@class, 'body')]", document, XPathConstants.NODESET);
            int length = nodeList.getLength();
            System.out.println(nodeList.getLength());
            for(int i = 0; i < length; i++) {
                Element element = (Element) nodeList.item(i);
                NodeList n = null;
                try {
                    n = (NodeList) xpath.evaluate("span[contains(@class, 'content')]", element, XPathConstants.NODESET);
                    String body = n.item(0).getTextContent();
                    System.out.println("Content: " + body);
                } catch (Exception e) {};

                try {

                    String date = (String) xpath.evaluate("span[contains(@class, 'meta entry')]/a/span/@data", element, XPathConstants.STRING);
                    System.out.println("DATA: " + date);

                    String source = (String) xpath.evaluate("//span[contains(@class, 'meta entry')]/span", element, XPathConstants.STRING);
                    System.out.println("DATA: " + source);

                } catch (Exception e) {};

                //This does not work at all! I've tried every combination and still can't get it to run
                try {
                    String info = (String) xpath.evaluate("//span[@class='shared-content']/a/@data", element, XPathConstants.STRING);
                    System.out.println("INFO: " + info);
                } catch (Exception e) {};

            }

The last expression does not work whatever combination I try. I've tried the following too but it doesn't help,

        String info = (String) xpath.evaluate("//span[contains(@class, 'shared-content')]/a/@data", element, XPathConstants.STRING);
        String info = (String) xpath.evaluate("//span[contains(@class, 'meta entry info')]/span/a/@data", element, XPathConstants.STRING);

Any suggestions?

EDIT: There have been a couple of suggestions about the XML being illegal (which honestly I am not sure myself as to why it is illegal because I've seen it almost everywhere till now) but I don't have control over the XML though (at least until Monday till my other pals get back). I am trying to see the feasibility of writing a mashup including this information. Is there someway to disable checking or something?

Here's the XML that was parsed:

       <?xml version="1.0" encoding="UTF-8"?>
          <span class="body">
            <span class="content">TextContent</span>
            <span class="meta entry">TextContent</span>

          </span>

I guess the document is not getting parsed correctly.

Legend
  • 113,822
  • 119
  • 272
  • 400
  • What exactly do you meann by "does not work"? Do you get a wrong result - if so, what is it? Or do you get an exception - and if so, what is it? – Pavel Minaev Nov 26 '09 at 22:36
  • The XML is perfectly fine. If it was wrong, your XML parser would throw an exception anyway, and you wouldn't get any other of your XPath calls to work. – Pavel Minaev Nov 26 '09 at 22:37
  • It just gives me a blank string. I mean no data. At least it doesn't return a null or throw an exception. – Legend Nov 26 '09 at 22:38
  • Your first try looks valid. Maybe bug in cobra parser? Sorry, no exact answer as I never used the cobra parser. – BalusC Nov 26 '09 at 22:40
  • Ah... I was hoping that I was wrong so that I can get away with a simple fix :) The XPath expressions work perfectly fine when I test it with the XPath expression Checker in Firefox. – Legend Nov 26 '09 at 22:41
  • I wonder if the parser doesn't produce the correct XML Infoset for this. Try dumping the parsed nodes in `document` as XML, and post the output. – Pavel Minaev Nov 26 '09 at 22:44
  • Let me check the documentation. I don't know how to do that with Cobra yet. You mean the structure of element in my code right? – Legend Nov 26 '09 at 22:47
  • 1
    I mean the structure of HTML loaded into memory. Judging by your code, you have it as an object of type `org.w3c.dom.Document`. What I suggest is that you write some code that iterates recursively over all child and attribute nodes in it, and dumps the resulting tree somewhere, so that you can look at it and check that all node relationships are as you expect them to be in the input HTML. I suspect the parser mishandles them somewhere. – Pavel Minaev Nov 26 '09 at 23:00
  • Just updated my post. I guess you were right. It wasn't getting parsed correctly... – Legend Nov 26 '09 at 23:13

4 Answers4

2

XPathVisualizer is a nice XPath Visualizer tool, runs on Windows, lets you see the results of your XPath queries. Xcopy install, a single EXE file. Free.

I took it and ran your query in it, got this result:

alt text

Cheeso
  • 189,189
  • 101
  • 473
  • 713
1

@Jherico,@Andrew Keith I don't know the COBRA HTMLParser, but combining #PCDATA with inner nodes is a legal XML format.
This could be defined like this in the DTD:

<!ELEMENT text_node     (#PCDATA|i|b|u)*>

This is the way a well-formatted HTML is still a legal XML.

jutky
  • 3,895
  • 6
  • 31
  • 45
0

I ran the following code

public static void main(String[] args) throws SAXException, IOException, ParserConfigurationException, XPathExpressionException {
    Document doc = XmlUtil.parseXmlResource("/temp.xml");
    for (Node n : XPathUtil.getNodes(doc, "//span[contains(@class, 'body')]")) {
        System.out.println(XPathUtil.getStringValue(doc, "//span[@class='shared-content']/a/@data"));
    }
}

And it output 'associate'. I think your XPath is fine. What is happening instead? And can you remove the empty catch blocks so we can see if you're actually getting exceptions?

Note, XmlUtil and XPathUtil are my own personal convenience functions to eliminate most of the XPath and XML boilerplate code.

Jherico
  • 28,584
  • 8
  • 61
  • 87
  • Thanks. I wonder why its not working here though. There are no exceptions being thrown at all which makes me wonder where it is going wrong. All it gives me is a blank string. Which library are you using by the way? – Legend Nov 26 '09 at 23:00
  • The built in Java 5 XML and XPath libraries. – Jherico Nov 26 '09 at 23:06
  • So i'll try to dump Cobra and use the built-in ones... Do you know any other better libraries? – Legend Nov 26 '09 at 23:14
  • Built-in parsers will parse XML, not HTML (they will parse XHTML, since that is an XML dialect, but not any random HTML). – Pavel Minaev Nov 26 '09 at 23:56
0

I just ran your code sample as is (copy paste) and got this output. So everything seems fine. (which cobra version are you using? Me 0.98.4)

1
Content:

DATA:
DATA:
      Information by
      Associate

INFO: associate

Reproducible test(?)

  • Using javac/java version 1.6.0_16 (HotSpot Client: build 14.2-b01, mixed mode, sharing)
  • I downloaded 0.98.4 (cobra-0.98.4.zip) from here Sourceforge: Cobra HTML Toolkit download
  • Extracted js.jar and cobra.jar from the cobra-0.98.4.zip:\lib to a directory XXX
  • Wrote XMLTest.java and HTMLTest.java in same directory (!filenames are links to source)
  • Ran this to compile (windows): javac -cp .;cobra.jar;js.jar *.java
  • Then executed like this (output included)

XMLTest

java -cp .;cobra.jar;js.jar XMLTest 1

XMLTest Output:

1
Content:

DATA:
DATA:
      Information by
      Associate

INFO: associate 

HTMLTest

java -cp .;cobra.jar;js.jar HTMLTest 1

HTMLTest Output:

1
Content:

DATA:
DATA:
      Information by
      Associate

INFO: associate
jitter
  • 53,475
  • 11
  • 111
  • 124
  • I am using the latest one off the official page which is 0.98.4. That is so strange. I just updated my post saying that the parser was not parsing the entire DOM. Are you using the same HTML parser provided by Cobra? I mean how did you construct the DOM? – Legend Nov 27 '09 at 01:58
  • Check expanded answer. Provided source too (tested with HTML and XML Parsing) – jitter Nov 27 '09 at 17:56