1

I am trying to parse some HTML using NekoHTML.

The problem is that when the below code snippet is executed on the SUN JDK 1.5.0_01 it works fine (this is when i am using eclipse with sun jre). But when the same thing is executed on IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 Windows XP x86-32 j9vmwi3223ifx-20070323 (JIT enabled) then it is not working (this is when i am using the IBM RAD for development).

NodeList tags = doc.getElementsByTagName("td"); 

for (int i = 0; i < tags.getLength(); i++) 
{
 Element elem = (Element) tags.item(i);
 // do something with elem
}

By working fine I mean that I am getting a list of "td" elements which I can process further. In case of the J9 I am not entering the for loop.

I am using latest version of NekoHTML (along with the bundled Xerces jars). The doc in the above code is of type org.w3.dom.Document (the runtime class used is org.apache.html.dom.HTMLDocumentImpl)

The IBM J9 details are as follows:

java version "1.5.0"
Java(TM) 2 Runtime Environment, Standard Edition (build pwi32devifx-20070323 (ifix 117674: SR4 + 116644 + 114941 + 116110 + 114881))
IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 Windows XP x86-32 j9vmwi3223ifx-20070323 (JIT enabled)
J9VM - 20070322_12058_lHdSMR
JIT  - 20070109_1805ifx3_r8
GC   - WASIFIX_2007)
JCL  - 20070131

Any idea, suggestion or workaround is appreciated. Thanks.

Favonius
  • 13,959
  • 3
  • 55
  • 95
  • *not entering the for loop* - does that mean, `tags` is an empty NodeList or do you get an exception? – Andreas Dolk Dec 21 '10 at 09:30
  • @Andreas: Yes `tags` is an empty NodeList. In case of an exception either it would have caught in the `try-catch` block (not posted as part of snippet) or shown on the console. – Favonius Dec 21 '10 at 09:52

1 Answers1

1

I have 2 ideas.

  1. I have just verified that xerces is a part of the JRE installation, so I believe it arrives to the classpath of your application from there. Probably SUN and IBM bring you different versions of xerces. So, as a first approach check it and probably try to replace what you have under IBM to the SUN's version. If it helps you have 2 options: continue running IBM java with xerces from SUN or continue to investigate what's wrong with xerces from IBM.
  2. Are there other differences between your dev and production environments? Are these the same operating systems? Is it a chance that you are using (for example) windows for development and unix for production but your xml is written on Windows with \r\n as a new line? Or even more: if your XML contains unicode characters and written in windows it can contain special (invisible) prefix that indicates that this is unicode. This prefix may cause parser to fail.
AlexR
  • 114,158
  • 16
  • 130
  • 208
  • +1 Thanks for the answer. **For your first point**, I am adding xerces jar as an external jar in my application. So from your answer I am not sure whether is it picking the default JRE version or from the jar I have added. **For the 2nd part**, The OS is same in both the case. So no problem from that front. – Favonius Dec 21 '10 at 13:30
  • Yup. It was due to classpath issue. Actually in my app I have heavily modified the nekohtml for performance (mostly the `AbstractDomParser` class was changed). In IBM J9 it was picking up the default implementation in jre/lib/xml.jar. Setting up the property `fConfiguration.setProperty(DOCUMENT_CLASS_NAME,"org.apache.html.dom.HTMLDocumentImpl");` solved the problem. Thanks. – Favonius Dec 24 '10 at 05:15