4

I have an application that is hitting an API. As such, it does a query for all ID's in the object, and then has to query each item one at a time per ID. I'm doing this in a Parallel.For loop and adding each items data to a row in a datatable. Then I use sqlbulkcopy to send the datatable to a SQL server table.

If I do this without using Parallel.For, it works great. However, with Parallel.For, this line:

workrow["id"] = Guid.NewGuid();

is generating duplicate Guids. It's doing it often and causing the data to not load into the SQL server table because the id row in SQL is a Primary Key and doesn't allow duplicates. I tried locking:

                    lock (lockobject)
                    {
                        workrow["id"] = Guid.NewGuid();
                    }

This didn't help.
I tried not assigning an id to that field in hopes that SQL would generate it (it does have newid() on that field). That fails saying it can't insert a null. I can't seem to just remove the id field from the datatable because then the columns don't align when I do the sqlbulkcopy.

Can someone help me here? I either need to figure out how to get Guid.NewGuid() to STOP producing duplicates OR I need to figure out a way to not pass in the id (always the first field in the datatable) so that SQL will generate the id.

Here is the code I use to generate one of tables:

        public static DataTable MakeWorkflowTable()
        {
            DataTable Workflow = new DataTable("Workflow");
            DataColumn id = new DataColumn("id", System.Type.GetType("System.Guid"));
            Workflow.Columns.Add(id);
            DataColumn OrgInfoID = new DataColumn("OrgInfoID", System.Type.GetType("System.Guid"));
            Workflow.Columns.Add(OrgInfoID);
            DataColumn Name = new DataColumn("Name", System.Type.GetType("System.String"));
            Workflow.Columns.Add(Name);
            DataColumn Active = new DataColumn("Active", System.Type.GetType("System.String"));
            Workflow.Columns.Add(Active);
            DataColumn Description = new DataColumn("Description", System.Type.GetType("System.String"));
            Workflow.Columns.Add(Description);
            DataColumn Object = new DataColumn("Object", System.Type.GetType("System.String"));
            Workflow.Columns.Add(Object);
            DataColumn Formula = new DataColumn("Formula", System.Type.GetType("System.String"));
            Workflow.Columns.Add(Formula);
            DataColumn ManageableState = new DataColumn("ManageableState", System.Type.GetType("System.String"));
            Workflow.Columns.Add(ManageableState);
            DataColumn NameSpacePrefix = new DataColumn("NameSpacePrefix", System.Type.GetType("System.String"));
            Workflow.Columns.Add(NameSpacePrefix);
            DataColumn TDACount = new DataColumn("TDACount", System.Type.GetType("System.Int32"));
            Workflow.Columns.Add(TDACount);
            DataColumn TriggerType = new DataColumn("TriggerType", System.Type.GetType("System.String"));
            Workflow.Columns.Add(TriggerType);
            DataColumn CreatedDate = new DataColumn("CreatedDate", System.Type.GetType("System.DateTime"));
            Workflow.Columns.Add(CreatedDate);
            DataColumn CreatedBy = new DataColumn("CreatedBy", System.Type.GetType("System.String"));
            Workflow.Columns.Add(CreatedBy);
            DataColumn LastModifiedDate = new DataColumn("LastModifiedDate", System.Type.GetType("System.DateTime"));
            Workflow.Columns.Add(LastModifiedDate);
            DataColumn LastModifiedBy = new DataColumn("LastModifiedBy", System.Type.GetType("System.String"));
            Workflow.Columns.Add(LastModifiedBy);
            return Workflow;
        }

Here is the code I use to send it to the SQL server:

        public static void SendDTtoDB(ref DataTable dt, ref SqlConnection cnn, string TableName)
        {
            using (SqlBulkCopy bulkCopy = new SqlBulkCopy(cnn))
            {
                bulkCopy.DestinationTableName =
                    TableName;
                try
                {
                    bulkCopy.WriteToServer(dt);
                    dt.Clear();
                }
                catch (Exception e)
                {
                    logger.Warn("SendDTtoDB {TableName}: ORGID: {ORGID} : {Message}", TableName, dt.Rows[0]["OrgInfoID"], e.Message.ToString());
                    if (e.Message.ToString().Contains("PRIMARY KEY"))
                    {
                        foreach(DataRow row in dt.Rows)
                        {
                            logger.Warn("ID: {id}", row["id"]);
                        }
                    }
                }
            }

        }

As you can see in the catch statement, I set it to write out the ID's to the log so I could see them for myself and, sure enough, there is a duplicate there. So frustrating! I really don't want to take out the Parallel.For and single thread it if I don't have to.

Per request, here is the code with the Parallel.For

              if (qr.totalSize > 0)
                {
                    object lockobject = new object();
                    Parallel.For(0, qr.records.Length, i =>
                    {
                        ToolingService.CustomTab1 vr = new ToolingService.CustomTab1();

                        vr = (ToolingService.CustomTab1)qr.records[i];
                        string mdSOQL = "Select FullName, description, ManageableState, MasterLabel, NamespacePrefix, Type, Url, CreatedDate, CreatedBy.Name, "
                            + "LastModifiedDate, LastModifiedBy.Name From CustomTab where id='" + vr.Id + "'";
                        ToolingService.QueryResult mdqr = new ToolingService.QueryResult();
                        ToolingService.CustomTab1 vrmd = new ToolingService.CustomTab1();
                        mdqr = ts.query(mdSOQL);
                        vrmd = (ToolingService.CustomTab1)mdqr.records[0];

                        DataRow workrow = CustomTabs.NewRow();
                        lock (lockobject)
                        {
                            workrow["id"] = Guid.NewGuid();
                        }
                        workrow["OrgInfoID"] = _orgDBID;
                        workrow["FullName"] = vrmd.FullName;
                        workrow["Description"] = vrmd.Description ?? Convert.DBNull;
                        workrow["ManageableState"] = vrmd.ManageableState;
                        workrow["MasterLabel"] = vrmd.MasterLabel ?? Convert.DBNull;
                        workrow["NameSpacePrefix"] = vrmd.NamespacePrefix ?? Convert.DBNull;
                        workrow["Type"] = vrmd.Type ?? Convert.DBNull;
                        workrow["URL"] = vrmd.Url ?? Convert.DBNull;
                        workrow["CreatedDate"] = vrmd.CreatedDate ?? Convert.DBNull;
                        if (vrmd.CreatedBy == null)
                        {
                            workrow["CreatedBy"] = Convert.DBNull;
                        }
                        else
                        {
                            workrow["CreatedBy"] = vrmd.CreatedBy.Name;
                        }
                        workrow["LastModifiedDate"] = vrmd.LastModifiedDate ?? Convert.DBNull;
                        if (vrmd.LastModifiedBy == null)
                        {
                            workrow["LastModifiedBy"] = Convert.DBNull;
                        }
                        else
                        {
                            workrow["LastModifiedBy"] = vrmd.LastModifiedBy.Name;
                        }
                        lock (CustomTabs)
                        {
                            CustomTabs.Rows.Add(workrow);
                        }

                    });
                    OrgTables.SendDTtoDB(ref CustomTabs, ref _cnn, "OrgCustomTabs");
Mike Jones
  • 51
  • 8
  • 1
    Can you show us the parallel.for? Guid.NewGuid should not be producing duplicates; it's likely something you're doing, but we can't inspect that code. –  Feb 12 '20 at 22:36
  • Updated with the code with the parallel.for loop. – Mike Jones Feb 12 '20 at 22:41
  • 3
    9 times out of 10 if you are using `Parallel.For` for a database, you are doing something wrong to start with. Firstly, IO workloads are not suited for those TPL methods. Secondly, Database access is inherently not thread safe. Thirdly, this most likely should be done in a single batch query – TheGeneral Feb 12 '20 at 22:41
  • It's done in a loop because I have to query the API over and over to get each item in the object. I'm doing the Parallel.For loop so I can speed up the process. If I need to get 450 items or more one at a time, I'd like to multithread that for speed purposes. The database access isn't being done in the loop, just building the datatable is because once I get the data back from the API I need to store it. – Mike Jones Feb 12 '20 at 22:58
  • 3
    I notice you are creating the `DataRow` (`DataRow workrow = CustomTabs.NewRow();`) outside of the `lock`, yet according to [Parallel.ForEach and DataTable - Isn't DataTable.NewRow() a thread safe “read” operation?](https://stackoverflow.com/q/56224441/150605) `NewRow()` is not thread-safe. So perhaps you're not actually creating the same `Guid` twice but a single `Guid` value is getting assigned to multiple rows. – Lance U. Matthews Feb 12 '20 at 23:10
  • 2
    If you want to speed up bulk inserting, what you want to do is abandon `DataTable` altogether. It is by far the least efficient way to get data to `SqlBulkCopy`. `SqlBulkCopy` is a streaming interface anyway, so gathering data in parallel gains you exactly nothing unless producing the data itself really is the bottleneck. Look at streaming through the overload that takes an `IDataReader`. Do not fetch the source rows one by one in parallel, use an `IN` or a table-valued parameter. – Jeroen Mostert Feb 12 '20 at 23:13
  • "so that SQL will generate the id" : you could use SqlBulkCopyColumnMapping to map the source to target. Target columns not mapped will get their defaults https://learn.microsoft.com/en-us/dotnet/api/system.data.sqlclient.sqlbulkcopycolumnmapping – lptr Feb 12 '20 at 23:31
  • 2
    Please note, if the guid is generated in the database, you could still get corrupt data in your DataTable when using multiple threads and improper usage of locks ;) – Julian Feb 12 '20 at 23:38

3 Answers3

12

I have seen this issue before. There is no problem with the Guid.NewGuid(), but the DataTable isn't threadsafe!

DataTable is simply not designed or intended for concurrent usage (in particular where there is any form of mutation involved).

See Thread safety for DataTable

Also related: c# DataGridView DataTable internal index corrupted in parallel loop

It's done in a loop because I have to query the API over and over to get each item in the object. I'm doing the Parallel.For loop so I can speed up the process. If I need to get 450 items or more one at a time, I'd like to multithread that for speed purposes. The database access isn't being done in the loop, just building the datatable is because once I get the data back from the API I need to store it.

You could create a type and add them multithreaded to a ConcurrentBag<T> or ConcurrentQueue<T> (or another concurrent collection, see MS Docs) - those are thread-safe :)

Then after it you could build the DataTable with a single thread. Or maybe skip the whole DataTable if possible for your use case.

Julian
  • 33,915
  • 22
  • 119
  • 174
  • From what I've read, as long as you use a lock when adding the row to the datatable, you should be fine. Is that not correct? – Mike Jones Feb 12 '20 at 23:03
  • I guess you need a lock on the whole table then. But I never try to recommend locks when there are thread-safe alternatives. – Julian Feb 12 '20 at 23:07
  • Updated the answer. – Julian Feb 12 '20 at 23:07
  • Thank you so much for your time and knowledge! I'm off to Youtube and Google to learn about ConcurrentBag's. :) I never give up so I'll figure them out eventually! – Mike Jones Feb 13 '20 at 01:11
3

The thing is that having to use a lock inside the Parallel.ForEach in a DataTable, sort of defeats the purpose of using the Parallel.ForEach in the first place; however, I am surprised that you don't get exceptions when you call DataRow workrow = CustomTabs.NewRow(); because on my test, I get an index corrupted exception. I had to actually wrap the call to NewRow inside a lock. Something like this:

Parallel.ForEach(data, x =>
            {
                DataRow row = null;
                lock (lockRow)
                {
                    row = dt.NewRow();
                    row["Guid"] = Guid.NewGuid();
                }
...
                lock(lockObj)
                   dt.Rows.Add(row);

Where lockObj and lockRow are 2 separate static objects instantiated as

static  object lockObj = new  object();
static  object lockRow = new object();

And that worked for me, adding 1 million rows to the DataTable and making sure that all the Guids were unique.

With all the above said, I would strongly recommend writing the code as suggested by Julian or create a class that implements IDataReader (which you can use with SQLBulkCopy) and upload the data using that.

Icarus
  • 63,293
  • 14
  • 100
  • 115
  • 1
    Nice that you've got it working! I agree with "lock inside the Parallel.ForEach in a DataTable, sort of defeats the purpose of using the Parallel.ForEach". Therefore I proposed and thread safe way without locks. (well of course the concurrent collections aren't guaranteed lock free, but those are lock minimized) – Julian Feb 12 '20 at 23:31
  • This is exactly what was happening! After posting this I found this article [link](https://programmersunlimited.wordpress.com/2012/01/26/datatables-thread-safe/) and figured this part out. I've made that change and it seems to have resolved the problem. While ultimately Julian's answer may be the best in the long run, I'm not that skilled yet and looking at the docs, it's greek to me. I never stop digging and learning so I will figure it out but this gets me back up and running for now. Thanks! – Mike Jones Feb 13 '20 at 01:06
-1

I used a guidgenerator that I lock every time I access it.


lock (guidGenerator)
{
  entity.Id = guidGenerator.NewGuid();
}

worked fine for me. It's a different approach though.