2

I'm parsing HTML and trying to value of a parent node itself, without values of the children nodes.

HTML example:

<html>
    <body>
        <div>
             <a href="http://intro.com">extra stuff</a>
             Text I would like to get.
             <a href="http://example.com">link to example</a>
        </div>
    </body>
</html>

Code:

def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlSlurper(tagsoupParser)
def htmlParsed = slurper.parseText(stringToParse)

println htmlParsed.body.div[0]

However above code returns:

extra stuff Text I would like to get. link to example

How can I get only parent node value without children? Example:

Text I would like to get.

P.S: I tried removing extra elements by doing substring but it proves to be unreliable.

MeIr
  • 7,236
  • 6
  • 47
  • 80

2 Answers2

2

If you switch to using XmlParser instead of XmlSlurper, you can do:

println htmlParsed.body.div[0].localText()[0]

Assuming you are on Groovy 2.3+

tim_yates
  • 167,322
  • 27
  • 342
  • 338
  • Tried and here is exception: No signature of method: groovy.util.slurpersupport.NodeChildren.localText() – MeIr Apr 14 '15 at 14:05
  • Fair enough...thought it was Parser only :-) – tim_yates Apr 14 '15 at 14:11
  • I wish I could switch to Parser but everything is written using Slurper. – MeIr Apr 14 '15 at 14:14
  • @Melr that looks like a Groovy version, not a Parser/Slurper issue... What version of Groovy are you on? – tim_yates Apr 14 '15 at 14:32
  • 1
    @MeIr I *think* you can use this kind of workaround: `htmlParsed.body.div[0].nodeIterator().collect().find().@children.findAll { it instanceof String }.join()` if you can't upgrade version... – tim_yates Apr 14 '15 at 14:38
1

No need to switch to XmlParser, just cast the first div as NodeChild:

def html = new XmlSlurper().parseText(xml)
def text = (html.body.div.first() as NodeChild).localText().first()
// Using @CompileStatic:
GPathResult html = new XmlSlurper().parseText(xml)
GPathResult div = html["body"]["div"]
String text = (div.first() as NodeChild).localText().first()
lepe
  • 24,677
  • 9
  • 99
  • 108