How to parse non-well formatted HTML with XmlSlurper

Question

I'm trying to parse a non-well-formatted HTML page with XmlSlurper, the Eclipse download site The W3C validator shows several errors in the page.

I tried the fault-tolerant parser from this post

@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14')
import org.cyberneko.html.parsers.SAXParser 
import groovy.util.XmlSlurper

// Getting the xhtml page thanks to Neko SAX parser 
def mirrors = new XmlSlurper(new SAXParser()).parse("http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz")    

mirrors.'**'

Unfortunately, it looks like not all content is parsed into the XML object. The faulty subtrees are simply ignored.

E.g. page.depthFirst().find { it.text() == 'North America'} returns null instead of the H4 element in the page.

Is there some robust way to parse any HTML content in groovy?

Opal · Accepted Answer · 2015-01-25T19:33:48.253

With the following piece of code it's getting parsed well (without errors):

@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14') 
import org.cyberneko.html.parsers.SAXParser 
import groovy.util.XmlSlurper

def parser = new SAXParser()
def page = new XmlSlurper(parser).parse('http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz')

However I don't know which elements exactly You'd like to find.

Here All mirrors are found:

page.depthFirst().find { 
    it.text() == 'All mirrors'
}.@href

EDIT

Both outputs are null.

println page.depthFirst().find { it.text() == 'North America'}

println page.depthFirst().find { it.text().contains('North America')}

EDIT 2

Below You can find a working example that downloads the file and parses it correctly. I used wget to download the file (there's something wrong with downloading it with groovy - don't know what)

@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14') 
import org.cyberneko.html.parsers.SAXParser 
import groovy.util.XmlSlurper

def host = 'http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz'
def temp = File.createTempFile('eclipse', 'tmp')
temp.deleteOnExit()

def cmd = ['wget', host, '-O', temp.absolutePath].execute()
cmd.waitFor()
cmd.exitValue()

def parser = new SAXParser()
def page = new XmlSlurper(parser).parseText(temp.text)

println page.depthFirst().find { it.text() == 'North America'}
println page.depthFirst().find { it.text().contains('North America')}

EDIT 3

And finally problem solved. Using groovy's url.toURL().text causes problems when no User-Agent header is specified. Now it works correctly and elements are found - no external tools used.

@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14') 
import org.cyberneko.html.parsers.SAXParser 
import groovy.util.XmlSlurper

def host = 'http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz'

def parser = new SAXParser()
def page = new XmlSlurper(parser).parseText(host.toURL().getText(requestProperties: ['User-Agent': 'Non empty']))

assert page.depthFirst().find { it.text() == 'North America'}
assert page.depthFirst().find { it.text().contains('North America')}

Can you please try this for me: mirrors.depthFirst().find { it.text() == "North America"} or this mirrors.depthFirst().find { it.text().contains("North America")} — allprog, Jan 23 '15 at 15:01
No idea why You can't try it yourself. I did it, both outputs are null. It seems that You had problems with parsing - it's now parsed well. If so, please accept and upvote. — Opal, Jan 24 '15 at 09:51
I'm sorry if you feel offended. I tried to be as polite as possible. I tried those myself and came to the same conclusion. But "North America" is present in the original HTML code inside a H3 element. So this is an example for your comment "I don't know which elements exactly You'd like to find". Yes, the parsing probably fails, so unfortunately, this is not the real answer yet. I need something more robust. — allprog, Jan 24 '15 at 22:03
Ok, that's not the problem. Problem is that You didn't clarified the question well. I thought it was the parser problem, not the that it's impossible to find an element. And.. this is isn't the problem of the parser itself. The page downloaded simply doesn't contain such element - it seems it's not downloaded fully. — Opal, Jan 25 '15 at 19:02
For those who use gradle/maven/whatever: http://mvnrepository.com/artifact/net.sourceforge.nekohtml/nekohtml — slim, Dec 09 '15 at 18:53

score 4 · Answer 2 · edited Mar 30 '17 at 13:04

4

I am fond of the tagsoup SAX parser, which says it's designed to parse "poor, nasty and brutish" HTML.

It can be used in conjunction with XmlSlurperquite easily:

@Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2')
def parser = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser())

def page = parser.parse('http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz')

println page.depthFirst().find { it.text() == 'North America'}
println page.depthFirst().find { it.text().contains('North America')}

This results in non-null output.

edited Mar 30 '17 at 13:04

aldrin

4,482
1
33
50

answered Jan 25 '15 at 02:15

bdkosher

5,753
2
33
40

Hmm, I tried both on linux and MacOS with brand new install of groovy 2.4. The `println`s showed `null` for both lines. Do you have any guess how can this happen? I simply copy-pasted to groovyConsole and tried the command line groovy too. The result is always `null` – allprog Jan 25 '15 at 14:11
I was using Groovy Console 2.3.7. Maybe it's an internal proxy server coming back with an authentication challenge, perhaps. Does `new URL('http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz').text` return the text you expect? – bdkosher Jan 25 '15 at 21:26
No, it doesn't. It was a problem with missing `User-agent` header. – Opal Jan 27 '15 at 12:58

How to parse non-well formatted HTML with XmlSlurper

2 Answers2