With the following piece of code it's getting parsed well (without errors):
@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14')
import org.cyberneko.html.parsers.SAXParser
import groovy.util.XmlSlurper
def parser = new SAXParser()
def page = new XmlSlurper(parser).parse('http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz')
However I don't know which elements exactly You'd like to find.
Here All mirrors
are found:
page.depthFirst().find {
it.text() == 'All mirrors'
}.@href
EDIT
Both outputs are null
.
println page.depthFirst().find { it.text() == 'North America'}
println page.depthFirst().find { it.text().contains('North America')}
EDIT 2
Below You can find a working example that downloads the file and parses it correctly. I used wget
to download the file (there's something wrong with downloading it with groovy - don't know what)
@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14')
import org.cyberneko.html.parsers.SAXParser
import groovy.util.XmlSlurper
def host = 'http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz'
def temp = File.createTempFile('eclipse', 'tmp')
temp.deleteOnExit()
def cmd = ['wget', host, '-O', temp.absolutePath].execute()
cmd.waitFor()
cmd.exitValue()
def parser = new SAXParser()
def page = new XmlSlurper(parser).parseText(temp.text)
println page.depthFirst().find { it.text() == 'North America'}
println page.depthFirst().find { it.text().contains('North America')}
EDIT 3
And finally problem solved. Using groovy's url.toURL().text
causes problems when no User-Agent
header is specified. Now it works correctly and elements are found - no external tools used.
@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14')
import org.cyberneko.html.parsers.SAXParser
import groovy.util.XmlSlurper
def host = 'http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz'
def parser = new SAXParser()
def page = new XmlSlurper(parser).parseText(host.toURL().getText(requestProperties: ['User-Agent': 'Non empty']))
assert page.depthFirst().find { it.text() == 'North America'}
assert page.depthFirst().find { it.text().contains('North America')}