1

I am on SQL Server 2012/SQL Server 2016.

The Xquery function 'modify' does only change one row at once. So I loop through my XML-data, element-by-element. This is quite slow. When there is only one distinct value to change I would cast the XML and replace the value. But there are multiple distinct values in most of the cases (rows).

I hope the structure of my_xml becomes clear from the (stronly simplified) data: There are multiple labels within a group of element (a and b).

--My data

if object_id ('tempdb..#t1') is not null drop table #t1

create table #t1 (id int not null identity (1,1) unique clustered, my_xml xml)
insert #t1 (my_xml) values 
     ('<root><a label=''name A'' /> <b /> </root>'),        --A
     ('<root><a label=''name A'' /> <b label=''name B'' /> </root>'),   --AB
     ('<root><a label=''name A'' /> <b label=''name B'' /> <a label=''name A'' /> <b label=''name B'' /><a label=''name A'' /> <b label=''name B'' /></root>'), --ABABAB
     ('<root><a label=''name B'' /> <b label=''name A'' /> <a label=''name B'' /> <a label=''name B'' /> <a label=''name B'' /> </root>') --BABBB

select * from #t1

I would like to change the value of the label to the SHA-Hash of the original value.

I query some information about the structure of my_xml to minimize the rows to modify:

--My helper-table    

if object_id ('tempdb..#t2') is not null drop table #t2

select 
     id, 
     max(my_xml.value('count(/root/*[self::a or self::b]/@label)', 'int')) max_label,
     max(my_xml.value('count(distinct-values(/root/*[self::a or self::b]/@label))', 'int')) max_label_distinct
into #t2
from #t1
group by id

select * from #t2

Then I modify XML (take only those rows with more than one distinct label as only with them I need to loop through the XML)

 --My query:

 set nocount on
 declare 
     @label         char(64), 
     @id                int,
     @n_max_label   smallint

 declare test cursor for 

     select 
         id, 
         max_label 
     from #t2
     where max_label_distinct > 1

 open test
 fetch next from test into @id, @n_max_label
 while (@@FETCH_STATUS <> -1)
 begin

     while @n_max_label > 0 
     begin

         select @label = isnull(convert(char(64), HASHBYTES('SHA2_256', my_xml.value('((/root/*[self::a or self::b]/@label)[sql:variable("@n_max_label")])[1]', 'char(64)')), 1), 'x') 
         from #t1
         where id = @id

         update #t1
         set my_xml.modify('replace value of ((/root/*[self::a or self::b]/@label)[sql:variable("@n_max_label")])[1] with sql:variable("@label")')
         where id = @id

         set @n_max_label = @n_max_label - 1

     end

     fetch next from test into @id, @n_max_label
 end
 close test
 deallocate test

final result:

 select * from #t1

I would like to improve the solution by the factor of 10 to 100 (in terms of speed) as by now I would need about 2 month to calculate the result - what is to long to manipulate about 600 GB of data. (unloading and using other tools is no option)

Christian4145
  • 513
  • 1
  • 9
  • 31
  • You could implement the replace-all-attribute-values-with-SHA-hash as a CLR function, converting the XML to `NVARCHAR(MAX)` first and passing that. This sidesteps all XML processing (and cursors) in T-SQL, which I think is going to be necessary if you're aiming for that big of a speedup. The cursor can probably be avoided by exploding the XML with `CROSS APPLY` and then gluing it back together again (somehow making sure to keep the unchanged elements), but I don't fancy the mechanics necessary for that, and it'll still be slow. – Jeroen Mostert Mar 26 '18 at 12:09
  • @Jeroren Mostert: Thanks for the ideas! Unfortunately I can't use CLR for security reasons. I tried CROSS APPLY but did not suceed with XML.modify passing through the SQL:variable. I don't know if there is a way to do that. – Christian4145 Mar 26 '18 at 12:18
  • You reduced your example - that's great, With this it might be possible to shred and re-create the XML, but I'm afraid, this is to simple... Are there always two different lable names involved? Are the values `name A` and `name B` always the same (per label)? Might such a value occur within the XML on a different place too? I'm aiming to: Read all target label values and replace them with a hash value on string level (`REPLACE`)... – Shnugo Mar 26 '18 at 12:21
  • In this case I'd suggest to provide an original example of your XML. Just reduce it, take away multi nodes, where they do not add something to the structure. Whereever you have *many*, reduce it to *two of them*. – Shnugo Mar 26 '18 at 12:24
  • @Shnugo: I have about 200k different names (like name A and name B). The origional XML is between 10 and 100 kByte (so quite long). I have 500 Mio XMLs. So quite a number of variations. But: The value to be hashed is always inside an attribute 'label'. The places of the element where 'label' can occur are two pathes (like root/a and root/b). The original XML is much more complex (Namespace, structure...), so I don't think it will help a lot. – Christian4145 Mar 26 '18 at 13:03

1 Answers1

1

As you've found out .modify() will change one single value at a time. This is slow and needs procedural (CURSOR) logic.

You did not provide a closer look onto your original XML so this might point to the wrong direction, but you might try this:

Your new helper table

WITH cte AS
(
    select 
          id 
         ,A.AllLabels.value('local-name(..)','nvarchar(max)') AS ElementName
         ,A.AllLabels.value('local-name(.)','nvarchar(max)') AS LabelName
         ,A.AllLabels.value('.','nvarchar(max)') AS LabelValue
         ,HASHBYTES('SHA2_256', A.AllLabels.value('.','nvarchar(max)')) AS LabelHash
    from #t1
     --use "/root/" instead of "//" if always below <root>
    cross apply my_xml.nodes('//*[local-name()="a" or local-name()="b"]/@label') AS A(AllLabels)
)
SELECT id
      ,ElementName
      ,LabelName
      ,LabelValue
      ,LabelHash
      ,(SELECT LabelHash FOR XML PATH(''),TYPE).value('.','nvarchar(max)') AS HashBase64
      ,CONCAT(' ',LabelName,'="',LabelValue,'"' ) AS ReplaceThis
      ,CONCAT(' ',LabelName,'="',(SELECT LabelHash FOR XML PATH(''),TYPE).value('.','nvarchar(max)'),'"' ) AS WithThat
INTO #t2
FROM cte
GROUP BY id,ElementName,LabelName,LabelValue,LabelHash;

SELECT * FROM #t2;

The result

+----+-------------+-----------+------------+--------------------------------------------------------------------+----------------------------------------------+----------------+------------------------------------------------------+
| id | ElementName | LabelName | LabelValue | LabelHash                                                          | HashBase64                                   | ReplaceThis    | WithThat                                             |
+----+-------------+-----------+------------+--------------------------------------------------------------------+----------------------------------------------+----------------+------------------------------------------------------+
| 1  | a           | label     | name A     | 0x6F5A56BD0B5E367787D4D3F798F5467C17FB798C0849ECF88373A7346EB405CB | b1pWvQteNneH1NP3mPVGfBf7eYwISez4g3OnNG60Bcs= | label="name A" | label="b1pWvQteNneH1NP3mPVGfBf7eYwISez4g3OnNG60Bcs=" |
+----+-------------+-----------+------------+--------------------------------------------------------------------+----------------------------------------------+----------------+------------------------------------------------------+
| 2  | a           | label     | name A     | 0x6F5A56BD0B5E367787D4D3F798F5467C17FB798C0849ECF88373A7346EB405CB | b1pWvQteNneH1NP3mPVGfBf7eYwISez4g3OnNG60Bcs= | label="name A" | label="b1pWvQteNneH1NP3mPVGfBf7eYwISez4g3OnNG60Bcs=" |
+----+-------------+-----------+------------+--------------------------------------------------------------------+----------------------------------------------+----------------+------------------------------------------------------+
| 3  | a           | label     | name A     | 0x6F5A56BD0B5E367787D4D3F798F5467C17FB798C0849ECF88373A7346EB405CB | b1pWvQteNneH1NP3mPVGfBf7eYwISez4g3OnNG60Bcs= | label="name A" | label="b1pWvQteNneH1NP3mPVGfBf7eYwISez4g3OnNG60Bcs=" |
+----+-------------+-----------+------------+--------------------------------------------------------------------+----------------------------------------------+----------------+------------------------------------------------------+
| 4  | b           | label     | name A     | 0x6F5A56BD0B5E367787D4D3F798F5467C17FB798C0849ECF88373A7346EB405CB | b1pWvQteNneH1NP3mPVGfBf7eYwISez4g3OnNG60Bcs= | label="name A" | label="b1pWvQteNneH1NP3mPVGfBf7eYwISez4g3OnNG60Bcs=" |
+----+-------------+-----------+------------+--------------------------------------------------------------------+----------------------------------------------+----------------+------------------------------------------------------+
| 2  | b           | label     | name B     | 0xC8144356A508FA516CC263E0839B7D22811A6F8333851F5319E778D94FE5B0C3 | yBRDVqUI+lFswmPgg5t9IoEab4MzhR9TGed42U/lsMM= | label="name B" | label="yBRDVqUI+lFswmPgg5t9IoEab4MzhR9TGed42U/lsMM=" |
+----+-------------+-----------+------------+--------------------------------------------------------------------+----------------------------------------------+----------------+------------------------------------------------------+
| 3  | b           | label     | name B     | 0xC8144356A508FA516CC263E0839B7D22811A6F8333851F5319E778D94FE5B0C3 | yBRDVqUI+lFswmPgg5t9IoEab4MzhR9TGed42U/lsMM= | label="name B" | label="yBRDVqUI+lFswmPgg5t9IoEab4MzhR9TGed42U/lsMM=" |
+----+-------------+-----------+------------+--------------------------------------------------------------------+----------------------------------------------+----------------+------------------------------------------------------+
| 4  | a           | label     | name B     | 0xC8144356A508FA516CC263E0839B7D22811A6F8333851F5319E778D94FE5B0C3 | yBRDVqUI+lFswmPgg5t9IoEab4MzhR9TGed42U/lsMM= | label="name B" | label="yBRDVqUI+lFswmPgg5t9IoEab4MzhR9TGed42U/lsMM=" |
+----+-------------+-----------+------------+--------------------------------------------------------------------+----------------------------------------------+----------------+------------------------------------------------------+

You can either use this to try a REPLACE on string base using the columns ReplaceThis and WithThat or you can use .modify within a CURSOR

Some things to keep in mind:

  • Within XML a binary value should be Base64 rather than a hex-string (normally)
  • You can do the multi replace in one single go using a quirky update UDF (like provided here, modify it to your needs)
  • If you do not have to look at the element containing your @label you can omit this column in your helper table to reduce the rows.
  • With thousands of data I'd suggest to use indexes on your helper table
  • You must be aware, that white-space can blow this up (label = "blah" is not the same as label="blah" on string level)
  • Same with single vs. double quotes
  • If you go with .modify() you can build a statement within your CURSOR dynamically and use EXEC(@YourStatement), but I doubt this will be fast...
Shnugo
  • 66,100
  • 9
  • 53
  • 114