2

I'm parsing HTML and trying to get full / not parsed value out of one particular node.

HTML example:

<html>
    <body>
        <div>Hello <br> World <br> !</div>
        <div><object width="420" height="315"></object></div>
    </body>
</html>

Code:

def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlSlurper(tagsoupParser)
def htmlParsed = slurper.parseText(stringToParse)

println htmlParsed.body.div[0]

However it returns only text in case of first node and I get empty string for the second node. Question: how can I retrieve value of the first node such that I get:

Hello <br> World <br> !
MeIr
  • 7,236
  • 6
  • 47
  • 80

1 Answers1

5

This is what I used to get the content from the first div tag (omitting xml declaration and namespaces).

Groovy

@Grab('org.ccil.cowan.tagsoup:tagsoup:1.2.1')
import org.ccil.cowan.tagsoup.Parser
import groovy.xml.*

def html = """<html>
    <body>
        <div>Hello <br> World <br> !</div>
        <div><object width="420" height="315"></object></div>
    </body>
</html>"""

def parser = new Parser()
parser.setFeature('http://xml.org/sax/features/namespaces',false)
def root = new XmlSlurper(parser).parseText(html)
println new StreamingMarkupBuilder().bindNode(root.body.div[0]).toString()

Gives

<div>Hello <br clear='none'></br> World <br clear='none'></br> !</div>

N.B. Unless I'm mistaken, Tagsoup is adding the closing tags. If you literally want Hello <br> World <br> !, you might have to use a different library (maybe regex?).

I know it's including the div element in the output... is this a problem?

Nick Grealy
  • 24,216
  • 9
  • 104
  • 119
  • Yeah, I would like not to include the 'div'. If you can find a way, that would be great!!! – MeIr Apr 08 '15 at 12:16
  • All I've got at the moment is ...`.toString().replaceAll(/^
    |<\/div>$/, "")`. Not sure if there's a way using the StreamingMarkupBuilder.
    – Nick Grealy Apr 08 '15 at 23:29