1

I have a follow-up question for this question: Groovy XmlSlurper get value of the node without children.

It explains that in order to get the local inner text of a (HTML) node without recursively get the nested text of potential inner child nodes as well, one has to use #localText() instead of #text().

For instance, a slightly enhanced example from the original question:

<html>
    <body>
        <div>
            Text I would like to get1.
            <a href="http://intro.com">extra stuff</a>
            Text I would like to get2.
            <a href="http://example.com">link to example</a>
            Text I would like to get3.
        </div>
        <span>
            <a href="http://intro.com">extra stuff</a>
            Text I would like to get2.
            <a href="http://example.com">link to example</a>
            Text I would like to get3.
        </span>
    </body>
</html>

with the solution applied:

def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlSlurper(tagsoupParser)
def htmlParsed = slurper.parseText(stringToParse)

println htmlParsed.body.div[0].localText()[0]

would return:

[Text I would like to get1., Text I would like to get2., Text I would like to get3.]

However, when parsing the <span> part in this example

println htmlParsed.body.span[0].localText()

the output is

[Text I would like to get2., Text I would like to get3.]

The problem I am facing now is that it's apparently not possible to pinpoint the location ("between which child nodes") of the texts. I would have expected the second invocation to yield

[, Text I would like to get2., Text I would like to get3.]

This would have made it clear: Position 0 (before child 0) is empty, position 1 (between child 0 and 1) is "Text I would like to get2.", and position 2 (between child 1 and 2) is "Text I would like to get3." But given the API works as it does, there is apparently no way to determine whether the text returned at index 0 is actually positioned at index 0 or at any other index, and the same is true for all the other indices.

I have tried it with both XmlSlurper and XmlParser, yielding the same results.

If I'm not mistaken here, it's as a consequence also impossible to completely recreate an original HTML document using the information from the parser because this "text index" information is lost.

My question is: Is there any way to find out those text positions? An answer requiring me to change the parser would also be acceptable.


UPDATE / SOLUTION:

For further reference, here's Will P's answer, applied to the original code:

def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlParser(tagsoupParser)
def htmlParsed = slurper.parseText(stringToParse)

println htmlParsed.body.div[0].children().collect {it in String ? it : null}

This yields:

[Text I would like to get1., null, Text I would like to get2., null, Text I would like to get3.]

One has to use XmlParser instead of XmlSlurper with node.children().

Community
  • 1
  • 1
SputNick
  • 1,231
  • 5
  • 15
  • 26

1 Answers1

1

I don't know jsoup, and i hope it is not interfering with the solution, but with a pure XmlParser you can get an array of children() which contains the raw string:

html = '''<html>
    <body>
        <div>
            Text I would like to get1.
            <a href="http://intro.com">extra stuff</a>
            Text I would like to get2.
            <a href="http://example.com">link to example</a>
            Text I would like to get3.
        </div>
        <span>
            <a href="http://intro.com">extra stuff</a>
            Text I would like to get2.
            <a href="http://example.com">link to example</a>
            Text I would like to get3.
        </span>
    </body>
</html>'''

def root = new XmlParser().parseText html

root.body.div[0].children().with {
    assert get(0).trim() == 'Text I would like to get1.'
    assert get(0).getClass() == String

    assert get(1).name() == 'a'
    assert get(1).getClass() == Node

    assert get(2) == '''
            Text I would like to get2.
            '''
}
Will
  • 14,348
  • 1
  • 42
  • 44
  • That's it! Apparently, it works with XmlParser only, not with XmlSlurper. Thank you. I'll update my question with the solution. I only whish Groovy would document the differences between those two classes more clearly... – SputNick Sep 14 '15 at 17:08