0

I posted this question

INSERT Statement Expensive Queries on Activity Monitor

As you will see the XML structure has different levels.

I have created different tables

Organisation = organisation_id (PRIMARY_KEY)
Contacts = organisation_id (FOREIGN_KEY)
Roles = organisation_id (FOREIGN_KEY)
Rels = organisation_id (FOREIGN_KEY)
Succs = organisation_id (FOREIGN_KEY)

What I want is to generate the organisation_id and do the insert on each table in cascading manner. At the moment the process takes almost 2 hours for 300k. I have 3 approach

  1. Convert XML to List Object and Send by batch(1000) as JSON text and send to a stored procedure the uses OPENJSON

  2. Convert XML to list object and send by batch (1000) and save the batch as JSON a file that SQL Server can read and pass the filepath on a stored procedure which then opens the JSON file using OPENROWSET and OPENJSON

  3. Send the path to XML to a stored procedure then use OPENROWSET and OPENXML.

All process (1-3) inserts the data into a FLAT temp table then iterate each row to call different INSERT stored procedure for each tables. Approach #3 seems to fail with errors on 300k but works on 4 records.

The other question is, will it be much faster if I use an physical table than a temp table?

-------UPDATE------- As explained on the link, I was doing while loop. Someone suggested / commented to do a batch insert on each of the table. The problem is, for example, Contacts I can only do this if I know the organisation_id

    select
        organisation_id = IDENTITY( bigint ) -- IF I CAN GENERATE THE ACTUAL ORGANISATION ID
        ,name = Col.value('.','nvarchar(20)')           
        ,contact_type = c.value('(./@type)[1]','nvarchar(50)')
        ,contact_value= c.value('(./@value)[1]','nvarchar(50)')
    into
        #temporganisations
    from
        @xml.nodes('ns1:OrgRefData/Organisations/Organisation') as Orgs(Col)
     outer apply Orgs.Col.nodes('Contacts/Contact') as Cs(c)

Then when I do the batch insert

insert into contacts
    (
      organisation_id,type,value
    )
select
   torg.organisation_id -- if this is the actual id then perfect
   ,torg.type
   ,torg.value
from #temporg torg
  • What actually is your question? You seem to be explaining what you are doing, and that it works? – Dale K May 09 '21 at 20:06
  • At the moment, 300k is process in almost 2 hours. I'm trying to find a way to speed it up. At the moment, I'm doing a while loop to get the data from the temp table that has 300k rows and call 5 stored procedure to insert on each table. I was hoping that instead of having an identity on the temp table I can sort of reserve the id for the organisation_id so I can use that to do a select and do an batch insert on each of the table instead of the while loop. I'll update the post to try to explain what I meant – Reydan Gatchalian May 09 '21 at 20:22
  • Not sure if my observation is correct. When I did it in batch (approach #2), the processing on the while loop was a little bit faster (around 200 records per second) than when I use OPENXML (approach #3) since the table has 300k it's probably taking a long time on the while (around 2-3 records per second). – Reydan Gatchalian May 09 '21 at 20:43
  • How about you scrap all of that and use something like `SqlBulkCopy` in C#, or `bcp` command-line, to bulk-insert the whole lot. It will probably be an order of magnitude faster. If you are doing an insert into tables, why convert it to XML or JSON and back again – Charlieface May 09 '21 at 20:44
  • @Charlieface. I have done the SqlBulkCopy for organisation information only and it's fast. the only trouble is the other tables where I need to link the organization to. like I can create a bulk to Contact but I need the organisation_id. I think it's more expensive to loop within the api and call the stored procedures one at a time than processing by batch. That is why I need to convert to JSON. Note that the raw data is XML. – Reydan Gatchalian May 09 '21 at 20:52
  • So then use Table-Valued Parameters, they are basically read-only table variables, passing them through is as fast as BulkCopy/bcp. Then you insert to parent table with an `output` clause which captures the new IDs into another table variable, and join that to the child table insert to get the FKs. Are you using C# primarily? – Charlieface May 09 '21 at 20:56
  • Also how would you identify a particular `OrganizationId`? Does it have any kind of unique identifying information (apart from the ID you want to generate)? – Charlieface May 09 '21 at 21:03
  • Organisation_Id is the SQL Server generated ID. There are 5 fields to check the correct organization which makes it a little bit more difficult. Hence on my update above, I was hoping I can hi-jack and reserve the organisation_id and put it on the temp table as per my example. – Reydan Gatchalian May 09 '21 at 21:31

2 Answers2

0

You can try to use the following conceptual example.

SQL

-- DDL and sample data population, start
USE tempdb;
GO

DROP TABLE IF EXISTS #city;
DROP TABLE IF EXISTS #state;

-- parent table
CREATE TABLE #state  (
   stateID INT IDENTITY PRIMARY KEY, 
   stateName VARCHAR(30), 
   abbr CHAR(2), 
   capital VARCHAR(30)
);
-- child table (1-to-many)
CREATE TABLE #city (
   cityID INT IDENTITY, 
   stateID INT NOT NULL FOREIGN KEY REFERENCES #state(stateID), 
   city VARCHAR(30), 
   [population] INT,
   PRIMARY KEY (cityID, stateID, city)
);
-- mapping table to preserve IDENTITY ids
DECLARE @idmapping TABLE (GeneratedID INT PRIMARY KEY,
    NaturalID VARCHAR(20) NOT NULL UNIQUE);

DECLARE @xml XML =
N'<root>
   <state>
      <StateName>Florida</StateName>
      <Abbr>FL</Abbr>
      <Capital>Tallahassee</Capital>
      <cities>
         <city>
            <city>Miami</city>
            <population>470194</population>
         </city>
         <city>
            <city>Orlando</city>
            <population>285713</population>
         </city>
      </cities>
   </state>
   <state>
      <StateName>Texas</StateName>
      <Abbr>TX</Abbr>
      <Capital>Austin</Capital>
      <cities>
         <city>
            <city>Houston</city>
            <population>2100263</population>
         </city>
         <city>
            <city>Dallas</city>
            <population>5560892</population>
         </city>
      </cities>
   </state>
</root>';
-- DDL and sample data population, end

;WITH rs AS 
(
    SELECT stateName   = p.value('(StateName/text())[1]', 'VARCHAR(30)'),
           abbr         = p.value('(Abbr/text())[1]', 'CHAR(2)'),
           capital      = p.value('(Capital/text())[1]', 'VARCHAR(30)')
    FROM   @xml.nodes('/root/state') AS t(p)
 )
 MERGE #state AS o
 USING rs ON 1 = 0
 WHEN NOT MATCHED THEN
    INSERT(stateName, abbr, capital)  
       VALUES(rs.stateName, rs.Abbr, rs.Capital)
 OUTPUT inserted.stateID, rs.stateName 
   INTO @idmapping (GeneratedID, NaturalID);

;WITH Details AS 
(
    SELECT NaturalID   = p.value('(StateName/text())[1]', 'VARCHAR(30)'),
           city         = c.value('(city/text())[1]', 'VARCHAR(30)'),
           [population]   = c.value('(population/text())[1]', 'INT')
    FROM   @xml.nodes('/root/state') AS A(p)   -- parent
      CROSS APPLY A.p.nodes('cities/city') AS B(c) -- child
) 
INSERT #city (stateID, city, [Population])
SELECT m.GeneratedID, d.city, d.[Population]
FROM   Details AS d
   INNER JOIN @idmapping AS m ON d.NaturalID = m.NaturalID;

-- test
SELECT * FROM #state;
SELECT * FROM @idmapping;
SELECT * FROM #city;
Yitzhak Khabinsky
  • 18,471
  • 2
  • 15
  • 21
0

I would suggest that you shred the XML client-side, and switch over to doing some kind of Bulk Copy, this will generally perform much better.

At the moment, you cannot do a normal bcp or SqlBulkCopy, because you also need the foreign key. You need a way to uniquely identify Organisation within the batch, and you say that is difficult owing to the number of columns needed for that.

Instead, you need to generate some kind of unique ID client-side, an incrementing integer will do. You then assign this ID to the child objects as you are shredding the XML into Datatables / IEnumerables / CSV files.

You have two options:

  • The easiest in many respects, is to not use IDENTITY from OrganisationId and just directly insert your generated ID. This means you can leverage standard SqlBulkCopy procedures.

The downside is that you lose the benefit of automatic IDENTITY assignment, but you could instead just use the SqlBulkCopyOptions.KeepIdentity option which only applies to this insert, and carry on with IDENTITY for other inserts. You would need to estimate a correct batch of IDs that won't clash.

A variation on this is to use GUIDs, these are always unique. I don't really recommend this option.


  • If you don't want to do this, then it gets quite a bit more complex.

You need to define equivalent Table Types for each of the tables. Each has a column for the temporary primary key of the Organisation

CREATE TYPE OrganisationType AS TABLE
    (TempOrganisationID int PRIMARY KEY,
     SomeData varchar...

Pass through the shredded XML as Table-Valued Parameters. You would have @Organisations, @Contacts etc.

Then you would have SQL along the following lines:

-- This stores the real IDs
DECLARE @OrganisationIDs TABLE
    (TempOrganisationID int PRIMARY KEY, OrganisationId int NOT NULL);

-- We need a hack to get OUTPUT to work with non-inserted columns, so we use a weird MERGE
MERGE INTO Organisation t
USING @Organisations s
  ON 1 = 0   -- never match
WHEN NOT MATCHED THEN
  INSERT (SomeData, ...)
  VALUES (s.SomeData, ...)
OUTPUT
    s.TempOrganisationID, inserted.OrganisationID
INTO @OrganisationIDs
    (TempOrganisationID, OrganisationID);

-- We now have each TempOrganisationID matched up with a real OrganisationID
-- Now we can insert the child tables

INSERT Contact
    (OrganisationID, [Type], [Value]...)
SELECT o.OrganisationID, c.[Type], c.[Value]
FROM @Contact c
JOIN @OrganisationIDs o ON o.TempOrganisationID = c.TempOrganisationID;

-- and so on for all the child tables
  • Instead of saving the IDs to a table variable, you could instead stream back the OUTPUT to client, and have the client join the IDs to the child tables, then BulkCopy them back again as part of the child tables.
    This makes the SQL simpler, however you still need the MERGE, and you risk complicating the client code significantly.
Charlieface
  • 52,284
  • 6
  • 19
  • 43