4

I'm trying to come up with a clean Xpath 1.0 expression in Excel's FILTERXML() function to return nodes with the following requirement:

  • The node must have a sibling (following) that starts with the exact same three characters of itself.

The point is to find out if there is some similarity to a certain degree in the data. Imagine the following sample data:

<t>
    <s>ABCDEF</s>
    <s>GHIJKL</s>
    <s>MNOPQR</s>
    <s>GHISTU</s>
    <s>ABVWXY</s>
</t>

From here, I'd like to return GHIJKL since it's first 3 characters, 'GHI', are found at the start of the second-to-last node.

I've been trying to piece together functions like starts-with(), substring() and count(), yet not been able to get it right. My (obviously wrong) attempt:

//s[count(following::*[starts-with(., substring(<placeholder>, 1,3))]>0]

I'm unsure if it's possible at all and as to what to write instead of the placeholder or how to rework the query to tell the expression I'd like to take the three leftmost characters of each node and test if there are any duplicates in the following ones.

JvdV
  • 70,606
  • 8
  • 39
  • 70

2 Answers2

3

Would following expression work?

//s[substring(., 1, 3) = following::*/substring(., 1, 3)]
Alexandra Dudkina
  • 4,302
  • 3
  • 15
  • 27
  • 1
    Yes. This would work. But unfortunately this requires XPath-2.0 functionality that the author cannot use. – zx485 Jan 25 '22 at 21:51
  • 1
    @zx485 is right here. It looks promising but won't budge in xpath 1.0. Thanks for the suggestion though. – JvdV Jan 26 '22 at 08:07
  • @JvdV Posted an alternative XPath solution as workaround (requiring only a small content substitution). - Alexandra's post is fine; *regret btw that there is no XPath 2.0 functionality in `FilterXML` and MSXML2 yet*. – T.M. Feb 13 '22 at 20:44
1

Requirement (clean Xpath 1.0 expression): The node must have a sibling (following) that starts with the exact same three characters of itself.

As far as I understand your post, you want to get only node contents actually having followers that meet this requirement.

Sort of workaround might be to include abbreviation attributes a to each s node and to use the following XPath expression:

    "//*[substring(.,1,3)=following-sibling::*/@a]"

To make this reproducible, I expanded the original xml content and added a numeric suffix to identify results easily (of course this would need a prior substitution I left aside to focus upon the main issue and to keep it short & simple).

Sub GetNodesHavingSimilarFollowers()
'a) Define wellformed xml content
    Dim content As String
    content = _
    "<t>" & _
        "<s a='ABC'>ABCDEF1</s>" & _
        "<s a='GHI'>GHIJKL2</s>" & _
        "<s a='MNO'>MNONot3</s>" & _
        "<s a='GHI'>GHINot4</s>" & _
        "<s a='ABV'>ABVNot5</s>" & _
        "<s a='ABC'>ABCPQR6</s>" & _
        "<s a='ABC'>ABCSTU7</s>" & _
        "<s a='ABC'>ABCNot8</s>" & _
    "</t>"
'~~~~~~~~~~~~~~~~~~~~~~~~~~
'b) Define XPath expression
'~~~~~~~~~~~~~~~~~~~~~~~~~~
    Dim XPth As String
    XPth = "//*[substring(.,1,3)=following-sibling::*/@a]"

'c) Execute FilterXML
    Dim x: x = Application.FilterXML(content, XPth)
    Select Case VarType(x)
        Case vbError
            Debug.Print "Error: ", x
        Case vbString
            Debug.Print x, "1 element only."
        Case Else
            Debug.Print Join(Application.Transpose(x), "|"), UBound(x) & " elements found."
End Select

End Sub

Example output in VB Editor's immediate window

As nodes 3 and 5 are solitaires and nodes 4 and 8 (though showing the same abbreviation as other previous nodes) have no direct follower, there are only four elements left:

ABCDEF1|GHIJKL2|ABCPQR6|ABCSTU7           4 elements found.
T.M.
  • 9,436
  • 3
  • 33
  • 57
  • 1
    Thanks @T.M. Very clever. It's a strategy I'll keep in mind, however I wanted to avoid any rework at 1st. I'm very curious if we can do this with just xpath steering clear from adjusting the XML. As time goes by I doubt that will happen though. – JvdV Feb 14 '22 at 08:10
  • I'm not too confident that FilterXML supports all possible variants of these simultanous substring comparisons between nodes and node siblings in a reliable way. For instance XPath testings like `"//*[substring(.,1,3)=substring(following::*,1,3)]"` delivered output, but there seem to be internal restraints for a satisfying & logically correct output. @JvdV – T.M. Feb 14 '22 at 20:34
  • This would come even closer to your needs: Under the condition that the xml node values are sorted (more precisely: sorted based at least on the first 3 characters), you could execute the following **XPath expression:** `"//*[substring(.,1,3)=substring(following-sibling::*,1,3)]"` which corresponds to comparing each node with its immediate neighbour (the same as using `"//*[substring(.,1,3)=substring(following-sibling::*[1],1,3)]"`) - @JvdV ` – T.M. Feb 15 '22 at 19:08