How to shred an relate XML generically defined (without distinct element names)

Question

A third party will be delivering XML that I need to shred into SQL Server relational format. The XML is unusual in a few ways

In an apparent attempt to be more flexible, the XML provides a section with "columnNames". The actual data is then provided later without distinct element names, and presumably I need to map the column names based on order.
The "row" section(s) that provide the actual data have no header attribute or other way of associating a set of fields in the row.

Here is a much simplified version of the data:

<ReportData>
    <ColumnName>SOLD_DATE</ColumnName>
    <ColumnName>STORE_NUMBER</ColumnName>
    <ColumnName>PHONE_NUMBER</ColumnName>
    <ColumnName>FAX_NUMBER</ColumnName>
    <Row>
        <Col>03/31/2016</Col>
        <Col>1234</Col>
        <Col>(425) 673-7065</Col>
        <Col>(425) 278-4974</Col>
    </Row>
    <Row>
        <Col>05/05/2016</Col>
        <Col>3456</Col>
        <Col>(425) 555-7065</Col>
        <Col>(425) 444-4974</Col>
    </Row>
</ReportData>

How can I turn that into a table:

Sold_Date   Store_Number    Phone_Number    Fax_Number
3/31/2016   1234            (425) 673-7065  (425) 278-4974
5/5/2016    3456            (425) 555-7065  (425) 444-4974

Thanks for any ideas. I originally considered using OpenXML with an edge table to take advantage of parent/sibling node values, but seems there should be a better way.

Shnugo · Accepted Answer · 2016-07-15T08:36:04.790

If you need your output like this, you need to name your columns dynamically. This is not possible - unless you use dynamic SQL. Try this:

DECLARE @xml XML=
'<ReportData>
    <ColumnName>SOLD_DATE</ColumnName>
    <ColumnName>STORE_NUMBER</ColumnName>
    <ColumnName>PHONE_NUMBER</ColumnName>
    <ColumnName>FAX_NUMBER</ColumnName>
    <Row>
        <Col>03/31/2016</Col>
        <Col>1234</Col>
        <Col>(425) 673-7065</Col>
        <Col>(425) 278-4974</Col>
    </Row>
    <Row>
        <Col>05/05/2016</Col>
        <Col>3456</Col>
        <Col>(425) 555-7065</Col>
        <Col>(425) 444-4974</Col>
    </Row>
</ReportData>';

The only thing you can rely on, is the sort-order within the XML

WITH ColumnNames AS
(
    SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS ColNr
          ,C.value('.','nvarchar(max)') AS Caption
    FROM @xml.nodes('/ReportData/ColumnName') AS A(C)
)
,RowLines AS
(
    SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS RowNr
          ,R.query('.') AS RowXML
    FROM @xml.nodes('/ReportData/Row') AS A(R)
)
,RowValues AS
(
    SELECT RowNr
          ,ROW_NUMBER() OVER(PARTITION BY RowNr ORDER BY (SELECT NULL)) AS ValNr
          ,C.value('.','nvarchar(max)') AS ColVal
    FROM RowLines
    CROSS APPLY RowXML.nodes('Row/Col') AS A(C)
)
SELECT * 
INTO #tmpResult
FROM RowValues
INNER JOIN ColumnNames ON ColNr=ValNr
ORDER BY RowNr,ValNr;

The table #tmpResult has now this content:

+-------+-------+----------------+-------+--------------+
| RowNr | ValNr | ColVal         | ColNr | Caption      |
+-------+-------+----------------+-------+--------------+
| 1     | 1     | 03/31/2016     | 1     | SOLD_DATE    |
+-------+-------+----------------+-------+--------------+
| 1     | 2     | 1234           | 2     | STORE_NUMBER |
+-------+-------+----------------+-------+--------------+
| 1     | 3     | (425) 673-7065 | 3     | PHONE_NUMBER |
+-------+-------+----------------+-------+--------------+
| 1     | 4     | (425) 278-4974 | 4     | FAX_NUMBER   |
+-------+-------+----------------+-------+--------------+
| 2     | 1     | 05/05/2016     | 1     | SOLD_DATE    |
+-------+-------+----------------+-------+--------------+
| 2     | 2     | 3456           | 2     | STORE_NUMBER |
+-------+-------+----------------+-------+--------------+
| 2     | 3     | (425) 555-7065 | 3     | PHONE_NUMBER |
+-------+-------+----------------+-------+--------------+
| 2     | 4     | (425) 444-4974 | 4     | FAX_NUMBER   |
+-------+-------+----------------+-------+--------------+

Now we need a dynamically created PIVOT statement:

DECLARE @colNames NVARCHAR(MAX)=
(
    STUFF(
    (
        SELECT DISTINCT ',' + Caption + ''
        FROM #tmpResult
        FOR XML PATH('')
    ),1,1,''
    )
);

DECLARE @cmd NVARCHAR(MAX)=
    'SELECT p.*
    FROM
    (
        SELECT RowNr,ColVal,Caption FROM #tmpResult
    ) AS tbl
    PIVOT
    (
        MAX(ColVal) FOR Caption IN(' + @colNames + ')
    ) AS p;';

EXEC(@cmd);

And this is the result:

+-------+----------------+----------------+------------+--------------+
| RowNr | FAX_NUMBER     | PHONE_NUMBER   | SOLD_DATE  | STORE_NUMBER |
+-------+----------------+----------------+------------+--------------+
| 1     | (425) 278-4974 | (425) 673-7065 | 03/31/2016 | 1234         |
+-------+----------------+----------------+------------+--------------+
| 2     | (425) 444-4974 | (425) 555-7065 | 05/05/2016 | 3456         |
+-------+----------------+----------------+------------+--------------+

Clean up

DROP TABLE #tmpResult;

Wow - thank you all for the creative ideas. Shnugo's solution has an advantage of dynamically handling any number of columns. If the vendor adds new columns, the approach would include the new column in the result set. That solution is sort of what I was thinking about when considering the edge table. But this is better. The piece I was missing to be able to do that using the .nodes method was not realizing I could call row_number over() with (select null) to get a row order without actually sorting! That will be helpful in many regards. Thanks again. — RiceTX, Jul 15 '16 at 18:50

score 1 · Answer 2 · answered Jul 15 '16 at 02:11

Consider querying to the <Row> nodes and then selecting <Col> by node index:

IF OBJECT_ID('tempdb..#demo', 'U') IS NOT NULL drop table #demo;
CREATE TABLE #demo (data xml)
INSERT INTO #demo (data) 
VALUES('<ReportData>
            <ColumnName>SOLD_DATE</ColumnName>
            <ColumnName>STORE_NUMBER</ColumnName>
            <ColumnName>PHONE_NUMBER</ColumnName>
            <ColumnName>FAX_NUMBER</ColumnName>
            <Row>
                <Col>03/31/2016</Col>
                <Col>1234</Col>
                <Col>(425) 673-7065</Col>
                <Col>(425) 278-4974</Col>
            </Row>
            <Row>
                <Col>05/05/2016</Col>
                <Col>3456</Col>
                <Col>(425) 555-7065</Col>
                <Col>(425) 444-4974</Col>
            </Row>
       </ReportData>');

SELECT
    Sold_Date = item.value('(Col)[1]', 'varchar(50)'), 
    Store_Number = item.value('(Col)[2]', 'varchar(50)'), 
    Phone_Number = item.value('(Col)[3]', 'varchar(50)'),
    Fax_Number = item.value('(Col)[4]', 'varchar(50)') 
FROM #demo
CROSS APPLY
  data.nodes('/ReportData/Row') AS dt(item);

-- Sold_Date    Store_Number   Phone_Number     Fax_Number
-- 03/31/2016   1234           (425) 673-7065   (425) 278-4974
-- 05/05/2016   3456           (425) 555-7065   (425) 444-4974

The *Here is a much simplified version of the data:* makes me think, that this is not possible with fix positions ans captions... — Shnugo, Jul 15 '16 at 08:14
And, this is much simpler to follow, but with potential to need to modify if new columns added. — RiceTX, Jul 15 '16 at 19:38

score 0 · Answer 3 · answered Jul 15 '16 at 04:16

Probably there is no elegant solution. At least it is possible to convert original data to something better readable. Here is an example.

declare @x xml=
'<ReportData>
    <ColumnName>SOLD_DATE</ColumnName>
    <ColumnName>STORE_NUMBER</ColumnName>
    <ColumnName>PHONE_NUMBER</ColumnName>
    <ColumnName>FAX_NUMBER</ColumnName>
    <Row>
        <Col>03/31/2016</Col>
        <Col>1234</Col>
        <Col>(425) 673-7065</Col>
        <Col>(425) 278-4974</Col>
    </Row>
    <Row>
        <Col>05/05/2016</Col>
        <Col>3456</Col>
        <Col>(425) 555-7065</Col>
        <Col>(425) 444-4974</Col>
    </Row>
</ReportData>'
declare @data xml

;with nm as (--ColumnNames
select  t.v.value('ColumnName[1]','varchar(20)') c1,
        t.v.value('ColumnName[2]','varchar(20)') c2,
        t.v.value('ColumnName[3]','varchar(20)') c3,
        t.v.value('ColumnName[4]','varchar(20)') c4
from @x.nodes('ReportData') t(v)
)
,ro as (--rows
select t.v.value('Col[1]','varchar(20)') r1, 
       t.v.value('Col[2]','varchar(20)') r2,
       t.v.value('Col[3]','varchar(20)') r3,
       t.v.value('Col[4]','varchar(20)') r4
from @x.nodes('ReportData/Row') t(v)
)
select @data= (select cast('<row '+nm.c1+'="'+ro.r1+'" '
         +nm.c2+'="'+ro.r2+'" '
         +nm.c3+'="'+ro.r3+'" '
         +nm.c4+'="'+ro.r4+'" />' as xml) 
from ro
cross apply nm
for xml  path(''),root)

select @data d

Resulting XML looks like this.

<root>
  <row SOLD_DATE="03/31/2016" STORE_NUMBER="1234" PHONE_NUMBER="(425) 673-7065" FAX_NUMBER="(425) 278-4974" />
  <row SOLD_DATE="05/05/2016" STORE_NUMBER="3456" PHONE_NUMBER="(425) 555-7065" FAX_NUMBER="(425) 444-4974" />
</root>

I like the approach to create a *better XML* but this was easier with XSLT probably. And the OP's *Here is a much simplified version of the data* lets me think, that the original XML might have more columns, maybe not even the same number. And one hint: If the data might inlcude forbidden characters this would break. This can be solved. I posted a solution here: http://stackoverflow.com/a/38369555/5089204 — Shnugo, Jul 15 '16 at 08:05

How to shred an relate XML generically defined (without distinct element names)

3 Answers3