0

I am using XPath extensively in the past. Currently I am facing a problem, which I am unable so solve.

Constraints

  • pure XPath 1.0
  • no aux-functions (e.g. no "concat()")

HTML-Markup

<span class="container">
    Peter: Lorem Impsum
    <i class="divider" role="img" aria-label="|"></i>
    Paul Smith: Foo Bar BAZ
    <i class="divider" role="img" aria-label="|"></i>
    Mary: One Two Three
</span>

Challenge

I want to extract the three coherent strings:

  • Peter: Lorem Impsum
  • Paul Smith: Foo Bar BAZ
  • Mary: One Two Three

XPath

The following XPath-queries is the best I've come up with after HOURS of research:

XPath-query 1

//span[contains(@class, "container")]

=> Peter: Lorem ImpsumPaul Smith: Foo Bar BAZMary: One Two Three

XPath-query 2

//span[contains(@class, "container")]//text()

Peter: Lorem Impsum Paul Smith: Foo Bar BAZ Mary: One Two Three

Problem

Although it is possible to post-process the resulting string using (PHP) string functions afterwards, I am not able to split it into the correct three chunks: I need an XPath-query which enables me to distinguish the text-nodes correctly.

Is it possible to integrate some "artificial separators" between the text-nodes?

Mads Hansen
  • 63,927
  • 12
  • 112
  • 147
NetWurst
  • 17
  • 4
  • What is the issue with the second XPath? Is it being flattened into a single string? You might want to post the relevant PHP used to execute the XPath and get the results. The issue might be in what you are doing with the results. – Mads Hansen Sep 02 '18 at 21:22
  • https://secure.php.net/manual/en/simplexmlelement.xpath.php `$result = $xml->xpath('//span[contains(@class, "container")]//text()'); while(list( , $node) = each($result)) { echo 'text() ',$node,"\n"; }` – Mads Hansen Sep 02 '18 at 21:24
  • If it's pure XPath 1.0 then there is a concat() function. If there is no concat() function then it is not pure XPath 1.0. Which is it? – Michael Kay Sep 03 '18 at 07:34
  • Thanks Mads and Michael for your answers! – NetWurst Sep 03 '18 at 14:00
  • @MadsHansen: you are totally right: I have been using the "wrong glue" between my text-nodes! ;-) – NetWurst Sep 03 '18 at 14:01
  • @Michael Kay: since concat-calls do not work (result string is always empty) in my system, I would not call it pure XPath 1.0 anymore... ;-) – NetWurst Sep 03 '18 at 14:02
  • @NetWurst if I called concat() and got an unexpected result, I would assume I had done something wrong. – Michael Kay Sep 03 '18 at 15:23

1 Answers1

1

You're expecting too much from XPath 1.0. XPath 1.0, itself, can help you here to select

  1. a string, or
  2. a set of text nodes

Then, you'll have to complete your processing outside of XPath (as Mads suggests in the comments).

To understand the limits you're hitting against, your first XPath,

//span[contains(@class, "container")]

selects a nodeset of span elements. The environment in which XPath 1.0 is operating is showing you (some variation of) the string value of the single such node in your document:

Peter: Lorem ImpsumPaul Smith: Foo Bar BAZMary: One Two Three

But be clear: Your XPath is selecting a nodeset of span elements, not strings here.

Your second XPath,

//span[contains(@class, "container")]//text()

selects a nodeset of text() nodes. The environment in which XPath 1.0 is operating is showing the string value of each selected text() node.

If you could use XPath 2.0, you could directly, within XPath, select a sequence of strings,

//span[contains(@class, "container")]/text()/string()

or you could join them,

string-join(//span[contains(@class, "container")]/text(), "|")

and directly get

Peter: Lorem Impsum
|
Paul Smith: Foo Bar BAZ
|
Mary: One Two Three

or

string-join(//span[contains(@class, "container")]/text()/normalize-space(), "|")

to get

Peter: Lorem Impsum|Paul Smith: Foo Bar BAZ|Mary: One Two Three
kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • Thank you, kjhughes, for your comprehensive response! The problem was: in my PHP-code I was "glueing" the text-nodes directly together, using a space " ". Now I have solved the problem by using " | " as a glue in my PHP-script, for this query/situation only. – NetWurst Sep 03 '18 at 14:00