1

I'm trying to get a distinct list of node names from an XML file. I've been successful with a recursive CTE similar to https://stackoverflow.com/a/2274091/1735928, but only with my files below about 1m characters. Above that, the query never seems to return. Some of my files are in the neighborhood of 100m characters.

I've since switched to trying PowerShell. For this samle XML:

<?xml version="1.0" encoding="UTF_8"?>
<root>
  <childA>
    <descendant1>
      <descendant1_1>
        <descendant1_1_1>1111111111</descendant1_1_1>
      </descendant1_1>
    </descendant1>
    <descendant2>0</descendant2>
  </childA>
  <childA>
    <descendant1>
      <descendant1_1>
        <descendant1_1_1>2222222222</descendant1_1_1>
      </descendant1_1>
    </descendant1>
    <descendant2>2</descendant2>
  </childA>
  <childB>
    <descendant1>
      <descendant1_1>
        <descendant1_1_1>2222222222</descendant1_1_1>
      </descendant1_1>
    </descendant1>
    <descendant3>0</descendant3>
  </childB>
  <childC>
    <descendant4>0</descendant4>
  </childC>
  <childC>
    <descendant4>6</descendant4>
  </childC>
</root>

I've gotten as far as:

$xml.childnodes[1].childnodes | select -uniq | foreach { $xml.childnodes[1].($_.name).childnodes.name | select -uniq }

which gives me:

descendant1
descendant2
descendant1
descendant3
descendant4

but that doesn't include the further descendants. Ultimately, I'm trying to get a table back to SQL that looks like this:

root | childA | descendant1
root | childA | descendant1_1
root | childA | descendant1_1_1
root | childA | descendant2
root | childB | descendant1
root | childB | descendant1_1
root | childB | descendant1_1_1
root | childB | descendant3
root | childC | descendant4
SQLFox
  • 290
  • 3
  • 8

2 Answers2

1

Below is a naive solution, assuming you know the depth of your xml. But probably you can do entire thing with xqury and thus do it on SQL side

[xml]$x = "your xml here"

# ------ LEVEL 2 children

$L2 = $x | Select-Xml "//root/*/*"

foreach($n in $L2) { 
 $L1 = $n.node.ParentNode.LocalName
 $CHILD = $n.node.localname
 [PSCustomObject]@{L1=$L1; CHILD = $CHILD}
}

# ------ LEVEL 3 children 


$L3 = $x | Select-Xml "//root/*/*/*"

foreach($n in $L3) { 
 $L1 = $n.node.ParentNode.ParentNode.LocalName
 $CHILD = $n.node.localname
 [PSCustomObject]@{L1=$L1; CHILD = $CHILD}
}

# ------ LEVEL 4 children 


$L4 = $x | Select-Xml "//root/*/*/*/*"

foreach($n in $L4) { 
 $L1 = $n.node.ParentNode.ParentNode.ParentNode.LocalName
 $CHILD = $n.node.localname
 [PSCustomObject]@{L1=$L1; CHILD = $CHILD}

}

Also adding sql xquery version. It still requires knowledge of the structure and only do one level at the time, but it doesnt have cross apply/joins so probably will work better on huge files

select
 T.c.query('local-name(.)') as self
 ,T.c.query('local-name(..)') as parent
 ,T.c.query('local-name(../..)') as Gparent
  ,T.c.query('local-name(../../..)') as GGparent
from @x.nodes('/root/*/*/*/*') T(c)
Mike Twc
  • 2,230
  • 2
  • 14
  • 19
1

If we are talking about hard-coded solutions to this problem, then here is my solution using SQL Server.

DECLARE @x XML = '
<root>
  <childA>
    <descendant1>
      <descendant1_1>
        <descendant1_1_1>1111111111</descendant1_1_1>
      </descendant1_1>
    </descendant1>
    <descendant2>0</descendant2>
  </childA>
  <childA>
    <descendant1>
      <descendant1_1>
        <descendant1_1_1>2222222222</descendant1_1_1>
      </descendant1_1>
    </descendant1>
    <descendant2>2</descendant2>
  </childA>
  <childB>
    <descendant1>
      <descendant1_1>
        <descendant1_1_1>2222222222</descendant1_1_1>
      </descendant1_1>
    </descendant1>
    <descendant3>0</descendant3>
  </childB>
  <childC>
    <descendant4>0</descendant4>
  </childC>
  <childC>
    <descendant4>6</descendant4>
  </childC>
</root>
';

SELECT
  x.n.value('fn:local-name(.)', 'NVARCHAR(MAX)') root,
  L2.n.value('fn:local-name(.)', 'NVARCHAR(MAX)') L2,
  L3.n.value('fn:local-name(.)', 'NVARCHAR(MAX)') L3,
  L4.n.value('fn:local-name(.)', 'NVARCHAR(MAX)') L4,
  L5.n.value('fn:local-name(.)', 'NVARCHAR(MAX)') L5
FROM @x.nodes('/*') x(n)
OUTER APPLY x.n.nodes('*') L2(n)
OUTER APPLY L2.n.nodes('*') L3(n)
OUTER APPLY L3.n.nodes('*') L4(n)
OUTER APPLY L4.n.nodes('*') L5(n);

Output

+------+--------+-------------+---------------+-----------------+
| root |   L2   |     L3      |      L4       |       L5        |
+------+--------+-------------+---------------+-----------------+
| root | childA | descendant1 | descendant1_1 | descendant1_1_1 |
| root | childA | descendant2 |               |                 |
| root | childA | descendant1 | descendant1_1 | descendant1_1_1 |
| root | childA | descendant2 |               |                 |
| root | childB | descendant1 | descendant1_1 | descendant1_1_1 |
| root | childB | descendant3 |               |                 |
| root | childC | descendant4 |               |                 |
| root | childC | descendant4 |               |                 |
+------+--------+-------------+---------------+-----------------+
Andrei Odegov
  • 2,925
  • 2
  • 15
  • 21
  • Did need to add a `CROSS APPLY` to unpivot the result, but this got what I needed. I'm still not sure why this approach is much faster than the CTE I started with, but it's a good trade for the dynamic recursion in this application. – SQLFox Jan 09 '19 at 22:04