Groovy - NekoHTML Sax parser

Question

I am having hard time with my NekoHTML parser. It is working fine on URL's but when I want to test in on a simple XML test, it does not read it properly.

Here is how I declare it:

def createAndSetParser() { 
  SAXParser parser = new SAXParser()  //Default Sax NekoHTML parser 
  def charset = "Windows-1252"  // The encoding of the page 
  def tagFormat = "upper"    // Ensures all the tags and consistently written, by putting all of them in upper-case. We can choose "lower", "upper" of "match" 
  def attrFormat = "lower"  // Same thing for attributes. We can choose "upper", "lower" or "match" 

  Purifier purifier = new Purifier()     //Creating a purifier, in order to clean the incoming HTML 
  XMLDocumentFilter[] filter = [purifier] //Creating a filter, and adding the purifier to this filter. (NekoHTML feature) 

  parser.setProperty("http://cyberneko.org/html/properties/filters", filter) 
  parser.setProperty("http://cyberneko.org/html/properties/default-encoding", charset) 
  parser.setProperty("http://cyberneko.org/html/properties/names/elems", tagFormat) 
  parser.setProperty("http://cyberneko.org/html/properties/names/attrs", attrFormat) 
  parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset", true)    // Forces the parser to use the charset we provided to him. 
  parser.setFeature("http://cyberneko.org/html/features/override-doctype", false)    // To let the Doctype as it is. 
  parser.setFeature("http://cyberneko.org/html/features/override-namespaces", false)     // To make sure no namespace is added or overridden. 
  parser.setFeature("http://cyberneko.org/html/features/balance-tags", true) 

  return new XmlSlurper(parser)   // A groovy parser that does not download the all tree structure, but rather supply only the information it is asked for. 
}

Again it is working very fine when I use it on websites. Any guess why I cannot do so on simple XML text samples ??

Any help greatly apreciated :)

what does it do when it fails? is there a stack trace? what's an example of an xml document that it fails on? more information, please. — Nathan Hughes, Aug 16 '11 at 16:16
Sorry for answering so late, and thank you for you answer. The parsing does not crash, and nothing is written in stack trace. It is just not parsing properly. For example if I give the following sample: bedroom kitchen The document path (path equivalent of document node) is actually the text "bedroom" ... Hence, my problem is this, it is not initializing the parsing properly, preventing me do do what I want. If you have any idea of what might be wrong... I am listening :) — Alexandre Bourlier, Aug 25 '11 at 13:37

stefanglase · Answer 1 · 2011-12-20T20:09:00.477

I made your script executable on the Groovy Console to try it out easily using Grape to fetch the required NekoHTML library from the Maven Central Repository.

@Grapes(
  @Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.15')
)

import groovy.xml.StreamingMarkupBuilder
import org.apache.xerces.xni.parser.XMLDocumentFilter
import org.cyberneko.html.parsers.SAXParser
import org.cyberneko.html.filters.Purifier

def createAndSetParser() { 
  SAXParser parser = new SAXParser()
  parser.setProperty("http://cyberneko.org/html/properties/filters", [new Purifier()] as XMLDocumentFilter[])
  parser.setProperty("http://cyberneko.org/html/properties/default-encoding", "Windows-1252")
  parser.setProperty("http://cyberneko.org/html/properties/names/elems", "upper")
  parser.setProperty("http://cyberneko.org/html/properties/names/attrs", "lower")
  parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset", true)
  parser.setFeature("http://cyberneko.org/html/features/override-doctype", false)
  parser.setFeature("http://cyberneko.org/html/features/override-namespaces", false)
  parser.setFeature("http://cyberneko.org/html/features/balance-tags", true) 
  return new XmlSlurper(parser)
} 

def printResult(def gPathResult) {
  println new StreamingMarkupBuilder().bind { out << gPathResult }
} 

def parser = createAndSetParser()

printResult parser.parseText('<html><body>Hello World</body></html>')
printResult parser.parseText('<house><room>bedroom</room><room>kitchen</room></house>')

When being executed this way the result of the two printResult-statements looks like shown below and can explain your issues parsing the XML string because it is wrapped into <html><body>...</body></html> tags and looses the root tag called <house/>:

<HTML><tag0:HEAD xmlns:tag0='http://www.w3.org/1999/xhtml'></tag0:HEAD><BODY>Hello World</BODY></HTML>
<HTML><BODY><ROOM>bedroom</ROOM><ROOM>kitchen</ROOM></BODY></HTML>

All this is caused by the http://cyberneko.org/html/features/balance-tags feature which you enabled in your script. If I disable this feature (it must be explicitly set to false because it defaults to true) the results looks like this:

<HTML><BODY>Hello World</BODY></HTML>
<HOUSE><ROOM>bedroom</ROOM><ROOM>kitchen</ROOM></HOUSE>

Groovy - NekoHTML Sax parser

1 Answers1