1

I have a really large and not to pretty XML which I want to import in my sql-server database. The format of the XML is like I said really ugly:

<myxml xmlns="http://somenamespace.whatever.com/schemas/xmldata/1/" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance">
    <mydata>
        <item>
            <record>some</record>
            <record>123</record>
            <record xs:nil="true" />
            <record>random</record>
            <record>234</record>    
        </item>
        <item>
            <record>345</record>
            <record>in all</record>
            <record>these</record>
            <record>cells</record>
            <record>123asdf</record>            
        </item>
        <item>
            <record>how</record>
            <record>to</record>
            <record>import</record>
            <record>987654321</record>
            <record xs:nil="true" />
        </item>
    </mydata>
</myxml>

This is just a small sample. In fact there the XML is over a 100ML, it has over 200k on items and each item has 15 records but this sample will do just fine.

I know what each 'record' in the 'item' represents but for me its enough to import all record-values to column with varchar(100). Lets say this table:"

CREATE TABLE [dbo].[DataFromXml](
    [Column1] [varchar](100) NULL,
    [Column2] [varchar](100) NULL,
    [Column3] [varchar](100) NULL,
    [Column4] [varchar](100) NULL,
    [Column5] [varchar](100) NULL
) ON [PRIMARY]
GO

I'm able to get this done with this code:

CREATE TABLE XmlTable
(
    XMLData XML
)

INSERT INTO XmlTable(XMLData)
SELECT CONVERT(XML, BulkColumn) 
FROM OPENROWSET(BULK 'D:\myverylarge.xml', SINGLE_CLOB) AS x;

DECLARE @XML AS XML
SELECT @XML=XMLData FROM XmlTable

;WITH XMLNAMESPACES ('http://www.w3.org/2001/XMLSchema-instance' as xs, DEFAULT 'http://somenamespace.whatever.com/schemas/xmldata/1/')
INSERT INTO DataFromXml(Column1, Column2, Column3, Column4, Column5)
SELECT  ref.value('record[1][not(@xs:nil = "true")]' ,'varchar(100)') as Column1
        ,ref.value('record[2][not(@xs:nil = "true")]' ,'varchar(100)') as Column2
        ,ref.value('record[3][not(@xs:nil = "true")]' ,'varchar(100)') as Column3
        ,ref.value('record[4][not(@xs:nil = "true")]' ,'varchar(100)') as Column4
        ,ref.value('record[5][not(@xs:nil = "true")]' ,'varchar(100)') as Column5
        FROM @XML.nodes('/myxml/mydata/item') xmlData( ref )

This runs for a minute or 2 which might not be so bad. I dont have a good reference. My feeling is that this could be a lot faster since getting the XML (over 100MB) into the database using the OPENROWSET just takes seconds.

Can I optimize the insert and if so, how would I do that?

Yitzhak Khabinsky
  • 18,471
  • 2
  • 15
  • 21
Stackberg
  • 300
  • 3
  • 12
  • 1
    "since getting the XML (over 100MB) into the database using the OPENROWSET just takes seconds." - sure? or" reading the file without any parsing" takes a few seconds. YOu do a REALLY large amount of processing there, which OPENROWSET does imho not do. Like not at all. – TomTom Jul 10 '20 at 14:34
  • I have never compared but you could also try OpenXml(). – Cetin Basoz Jul 10 '20 at 14:43

2 Answers2

2

Dealing with NULL values is something special in XML.

The definition of a NULL value in XML is not existing. So

<a>
    <b>hi</b>
    <c></c>
    <d/>
</a>
  • <a> is the root elelement.
  • <b> is an element with a text() node.
  • <c> is an empty element
  • <d> is a self-closing element
  • <e> is - uhm - not there...

Important hint: <c> and <d> are the same, absolutely no difference!

You can query for the element with

.value('(/a/b)[1]','nvarchar(100)')

And you can query for the the text() node specifically

.value('(/a/b/text())[1]','nvarchar(100)')

In this you find a possible answer (a bit hidden): You can do all your code without the NULL checking predicate, if you query for the text() node specifically.

Change this

ref.value('record[1][not(@xs:nil = "true")]' ,'varchar(100)')

to this

ref.value('(record[1]/text())[1]' ,'varchar(100)')

What might break this: If a <record>'s content may be an empty string you would get a NULL back and not ''. But it should be much faster... Hope, this is okay for you...

About performance: Read this answer. It covers your issue quite well. Especially the part where the time is consumed (follow the links in this answer).

Shnugo
  • 66,100
  • 9
  • 53
  • 114
2

Just to complement @Shnugo answer.

All credit goes to him.

This is your exact SQL statement. It should give you around 20% performance improvement. Please give it a shot.

;WITH XMLNAMESPACES ('http://www.w3.org/2001/XMLSchema-instance' as xs, DEFAULT 'http://somenamespace.whatever.com/schemas/xmldata/1/')
INSERT INTO DataFromXml(Column1, Column2, Column3, Column4, Column5)
SELECT  ref.value('(record[1]/text())[1]' ,'varchar(100)') as Column1
        ,ref.value('(record[2]/text())[1]' ,'varchar(100)') as Column2
        ,ref.value('(record[3]/text())[1]' ,'varchar(100)') as Column3
        ,ref.value('(record[4]/text())[1]' ,'varchar(100)') as Column4
        ,ref.value('(record[5]/text())[1]' ,'varchar(100)') as Column5
FROM @XML.nodes('/myxml/mydata/item') xmlData(ref);
Yitzhak Khabinsky
  • 18,471
  • 2
  • 15
  • 21