XmlSlurper never finds node

Question

I'm trying to page scrape some DOM that looks like this:

<span>text</span>

and sometimes looks like this:

<span><p>text</p></span>

However, I just can't seem to figure out how to get text in the second scenario. I've tried several methods, and here's what I thought should work below:

def html = slurper.parse(reader)
Collection<NodeChild> nodes = html.'**'.findAll { it.name() == 'span' && it.@class == 'style2' }
...
def descriptionNode = html.'**'.find { it.name() == 'span' && it.@class == 'style20' }
def innerNode = descriptionNode.'**'.find { it.name() == 'p' }
def description
if (innerNode?.size() > 0)
{
description = innerNode.text()
}
else
{
description = descriptionNode.text()
}

Any idea how I need to go about using xmlslurper to get the behavior I need?

ataylor · Answer 1 · 2011-01-24T08:08:51.723

It sounds like you want to check if a given span contains a nested p. You can iterate over the span node's children to check for that case. Example:

def xml = """
<test>
  <span>test1</span>
  <span><p>test2</p></span>
  <other><span>test3</span></other>
  <other><span><p>test4</p></span></other>
</test>
"""

def doc = new XmlSlurper().parseText(xml)
def descriptions = []
doc.'**'.findAll { it.name() == 'span' }.each { node ->
    if (node.children().find { it.name() == 'p' }) {
            descriptions << node.p.text()
    } else {
            descriptions << node.text()
    }
}
assert descriptions == ['test1', 'test2', 'test3', 'test4']

score 0 · Answer 2 · edited Nov 18 '12 at 05:10

0

have you tried the xpath: //span/text() ? you might need to query twice to account for the p tagged.

edited Nov 18 '12 at 05:10

krock

28,904
13
79
85

answered Jan 24 '11 at 06:08

Steven

3,844
3
32
53

score 0 · Accepted Answer · answered Jan 25 '11 at 02:25

0

As it turns out, the HTML must have been invalid. Tagsoup created

<div>
<span>
</span>
<p></p>
</div>

but Firebug displayed

<div>
<span>
<p></p>
</span>
</div>

What a terrible bug.

answered Jan 25 '11 at 02:25

Stefan Kendall

66,414
68
253
406

XmlSlurper never finds node

3 Answers3