Problem trying to clone a Project Server database using OData and Entity Framework

Question

I am having trouble updating my entities with Parallel.Foreach. The program I have, works fine by using foreach to update the entities, but if I use Parallel.Foreach it gives me the error like : "Argument Exception: An item with the same key has already been added". I have no idea why it happens, shouldn't it be thread safe? Or why giving me this error? How to resolve this issue?

The program itself get some data from a database and copy it to another one. If the datarow exists with the same guid (see below), and the status unchanged the matching datarow in the second must be updated. If theres a match, and status changed, modifications must be ignored. Finally if no match in the second database, then insert the datarow into the second database. (Synchronize the two databases). I just want to speed up the process somehow, that is why I first think of parallel processing.

(I am using Autofac as an IoC container and dependency injection if that matters)

Here is the code snippet which tries to update:

     /* @param reports: data from the first database */
   public string SynchronizeData(List<Reports> reports, int statusid)
    {
        // reportdataindatabase - the second database data, List() actually selects all, see next code snippet

        List<Reports> reportdataindatabase = unitOfWorkTAFeedBack.ReportsRepository.List().ToList();

        int allcount = reports.Count;
        int insertedcount = 0;
        int updatedcount = 0;
        int ignoredcount = 0;

     // DOES NOT WORK, GIVES THE ERROR
        Parallel.ForEach(reports, r =>
        {
            var guid = reportdataindatabase.FirstOrDefault(x => x.AssignmentGUID == r.AssignmentGUID);

            if (guid == null)
            {
                unitOfWorkTAFeedBack.ReportsRepository.Add(r); // an insert on the repository
                insertedcount++;
            }
            else
            {
               if (guid.StatusId == statusid)
                {
                    r.ReportsID = guid.ReportsID;
                    unitOfWorkTAFeedBack.ReportsRepository.Update(r); // update on the repo
                    updatedcount++;
               }
                else
               {
                    ignoredcount++;
                }

            }
        });




 /* WORKS PERFECTLY BUT RELATIVELY SLOW - takes 80 seconds to update 1287 records
        foreach (Reports r in reports)
        {
            var guid = reportdataindatabase.FirstOrDefault(x => x.AssignmentGUID == r.AssignmentGUID); // find match between the two databases

            if (guid == null)
            {
                unitOfWorkTAFeedBack.ReportsRepository.Add(r); // no match, insert
                insertedcount++;
            }
            else
            {
                if (guid.StatusId == statusid)
                {
                    r.ReportsID = guid.ReportsID;
                    unitOfWorkTAFeedBack.ReportsRepository.Update(r); 
                    updatedcount++;
                }
                else
                {
                    ignoredcount++;
                }

            }

        } */

        unitOfWorkTAFeedBack.Commit(); // this only calls SaveChanges() on DbContext object

        int allprocessed = insertedcount + updatedcount + ignoredcount;

        string result = "Synchronization finished.  " + allprocessed + " reports processed out of " + allcount + ", " 
            + insertedcount + " has been inserted, " + updatedcount + " has been updated and " 
            + ignoredcount + " has been ignored. \n Press a button to dismiss this window."  ;

        return result;

    }

The program breaks on this Repository class in the Update method (with Parallel.Foreach, no problem with the standard foreach):

 public class EntityFrameworkReportsRepository : IReportsRepository
{

    private readonly TAFeedBackContext tAFeedBackContext;

    public EntityFrameworkReportsRepository(TAFeedBackContext tAFeedBackContext)
    {
        this.tAFeedBackContext = tAFeedBackContext;
    }

    public void Add(Reports r)
    {
        tAFeedBackContext.Reports.Add(r);
    }

    public void Delete(int Id)
    {
        var obj = tAFeedBackContext.Reports.Find(Id);
        tAFeedBackContext.Reports.Remove(obj);
    }

    public Reports Get(int Id)
    {
        var obj = tAFeedBackContext.Reports.Find(Id);
        return obj;
    }

    public IQueryable<Reports> List()
    {
        return tAFeedBackContext.Reports.AsNoTracking();
    }

    public void Update(Reports r)
    {
        var entry = tAFeedBackContext.Entry(r); // The Program Breaks At This Point!
        if (entry.State == EntityState.Detached)
        {
            tAFeedBackContext.Reports.Attach(r);
            tAFeedBackContext.Entry(r).State = EntityState.Modified;
        }
        else
        {
            tAFeedBackContext.Entry(r).CurrentValues.SetValues(r);
        }
    }


}

Trying to execute N updates in parallel is far worse than executing a batch of N updates. You can't fix bad data access code by running more of it, this just adds concurrency conflicts on top the already bad performance — Panagiotis Kanavos, Jan 13 '21 at 10:39
As for the "repository', that's the actual cause of all the trouble. Using a "generic repository" on top of a high-level ORM is an ugly *anti*pattern. A DbSet is already a Repository, a DbContext is already a Unit-of-Work. Changes are only saved to the database when `SaveChanges` is called. — Panagiotis Kanavos, Jan 13 '21 at 10:41
`unitOfWorkTAFeedBack.ReportsRepository.Add` I doubt that is thread-safe. — mjwills, Jan 13 '21 at 10:42
As for using `Parallel.ForEach` here, it adds *serious* overhead and delays. Instead of using one query to read 100 items by ID, it tries to execute 100 concurrent single-item queries. Even if that worked, this would mean 100 times the connection overhead, 100 times the IO on the server, contention for access to disks, memory. — Panagiotis Kanavos, Jan 13 '21 at 10:44
Why are you trying to load all items. concurrently instead of using one query? Did you encounter a real problem? Don't know how to use multiple IDs? `myContext.Reports.Where(rep=>ids.Contains(rep.ID))` will be translated to `WHERE ID iN (@id1, @id2, @id3.....)` — Panagiotis Kanavos, Jan 13 '21 at 10:46
`The program itself get some data from a database and copy it to another one.` if that's all, you don't need an ORM at all. You'd get orders of magnitude better performance (not 10x, 100-1000x better) if you used a simple DataReader to read the source data and copy it to the target with. SqlBulkCopy, or the equivalent for your database. ORMs aren't meant for ETL. There are no business objects or behavior in data movement and transformation, the domain objects are Table, Row, Field, Value, not Reports or Feedback. — Panagiotis Kanavos, Jan 13 '21 at 10:50
You can use eg Dapper to execute a query in a single line and return a DbDataReader you can pass to SqlBulkCopy — Panagiotis Kanavos, Jan 13 '21 at 10:51
You could also try storing entities to update in a separate list and then update them outside of parallel. — misticos, Jan 13 '21 at 11:05
I do not see why this would be an antipattern. What if I have to change from Entity Framework to something else? Switch back to ADO.Net, or something newer comes? DbSet class is found in Entities dll as well DbContext. But IRepositories can be remain unchanged - as I working on these abstractions. I do not rely on concrete classes. EntityFrameworkRepository is just one implementations of IRepository. What is the problem with that? — Newbie1001, Jan 13 '21 at 11:45
You get a point. I could have gathered the ids, from the first list and find the matches in the second, update them. I will check it out and let you know the results. — Newbie1001, Jan 13 '21 at 11:54
"The program itself get some data from a database and copy it to another one" That is not true, I just write that for the sake of simplicity. The truth is, the first database I do not have direct access - the initial data comes from Microsoft Project 2016's Server generated database, and I only have access to it through it's OData API. The generated response contains the field values, which is the copy of my Reports (Reports is a DTO) fields. And the second database is the copy of the MPS database with some extra fields I have to use. — Newbie1001, Jan 13 '21 at 12:05
@Newbie1001 update the question and make it *crystal clear* what you use OData, that you read from Project Server 2016. I used to be a SharePoint MVP. I've installed Project Server numerous times. I'm using OData in Blazor no less and have to dig into the actual source to troubleshoot bugs. And I had **no idea** that what you posted had anything to do with OData. — Panagiotis Kanavos, Jan 14 '21 at 08:25
`And the second database is the copy of the MPS database with some extra fields I have to use.` what does that mean? There's not just a single database, some of which aren't supposed to be used by other applications. You most definitely **don't** need to copy any of the databases, not even the reporting database. Project Server has mechanisms not only to add extra fields, but to include them in reports and cubes. It can also connect to *external* databases to retrieve data. Cloning an MPS database is. just **WRONG**. How are you going to display that data in PS's reports now? — Panagiotis Kanavos, Jan 14 '21 at 08:29
Even if you had to clone the database, which you don't, because PS already allows you to include other tables and databases, you could use SQL Server's replication or even a simple data copying job. EF isn't an ETL tool. And even if you don't have access to the database, you don't need EF to *insert* the objects you jus got. All you need is SqlBulkCopy and a way to convert the objects you got into a DataReader. That's what [ObjectReader](https://github.com/mgravell/fast-member#ever-needed-an-idatareader) does. Write the data into a staging table, then update your existing data — Panagiotis Kanavos, Jan 14 '21 at 08:34
Of course the real solution is to get rid of the clone. What is the *actual* problem you tried to solve and why did you try to clone the data instead of configuring Project Server? Why not the OLAP cubes? Or read from the reporting database? — Panagiotis Kanavos, Jan 14 '21 at 08:38
Or actually include the OData as a source to your reports? That's what OData is for. To allow integration with other systems, especially reporting systems. Excel can read from OData directly (and PS). So does Power BI. — Panagiotis Kanavos, Jan 14 '21 at 08:40
@Panagiotis Kavanos - to keep it simple, let's say Project Server proved to be slow at generating reports, and does not contain the correct mechanism to task approval or rejection process (or proved to be too difficult to use - I am not sure, I am not a PS expert), which is requested by it's users. They want to use it in mobile application later, so the idea behind this is, I get the data from OData, copy the fields of "Assigments" (Projects — Newbie1001, Jan 14 '21 at 08:56
Then through a web application they will access it on android devices, make some changes on this database, and then this web application will update Project Server's database through its API at certain times (Batch update, because opening connection, update the fields one by one takes too much time, and projects must be checked out thorugh the process - if someone checks out, others can not make modifications on it). — Newbie1001, Jan 14 '21 at 09:05
To give you a clear a view on this problem, there are thousands of tasks (one project can contain 15000-20000 tasks) and imagine that only 1 person can make modifications at a time, that is horrible payload. So they came up with an idea to store the modifications in a database (which is a copy of project servers) so everybody related to the project can access this data (which is an illusion, since they not access the Project Server, they access only the Copy of Project Server db), and upload them altogether, let's say, every 30 minutes. — Newbie1001, Jan 14 '21 at 09:14
So the problem is really complicated, that is why not came up with it in the first place. :) — Newbie1001, Jan 14 '21 at 09:15
@Newbie then you have to learn how PS works. You can't just start writing code without understanding what's involved. What if you can access SharePoint through its own mobile app? (You can). Or access PWA through a mobile browser? That's also available. You want to create an app that performs certain operations only? You don't need a cloned database then, you can read the data and make the changes you want through PS's web services — Panagiotis Kanavos, Jan 14 '21 at 09:17
`the problem is really complicated` it was made more complicated by trying to bypass Project Server or using the existing integration methods. What you want to do is nothing new, thousands of companies already had to do similar things, so the functionality was added to PS. Even if you really need to write a program to copy data (you don't, you *can* access some databases), all you need is SqlBulkCopy. The duplicate shows how to do that — Panagiotis Kanavos, Jan 14 '21 at 09:20
Read the duplicate question. About 10 lines do what you tried to do, and they actually do it *faster*. SqlBulkCopy will send the data in a continuous stream, using the same mechanism and minimal logging used by `bcp` or `BULK INSERT`. By writing the data into a staging table and UPDATEing the target table no data needs to be read from the database. A single big UPDATE between two joined tables on the server won't use any bandwidth and won't have to wait for the data to get to the server. It avoids the concurrency conflicts caused by executing multiple UPDATE operations in parallel — Panagiotis Kanavos, Jan 14 '21 at 09:32
@PanagiotisKanavos Thanks for your help and your time. I will check it out asap. — Newbie1001, Jan 14 '21 at 10:03

score 2 · Answer 1 · edited Jan 14 '21 at 08:23

Please bear in mind it hard to give a complete answer as there are thing I need clarity on … but comments should help with building a picture.

Parallel.ForEach(reports, r => //Parallel.ForEach is not the answer..
{
    //reportdataindatabase is done..before so ok here
    // do you really want FirstOrDefault vs SingleOrDefault
    var guid = reportdataindatabase.FirstOrDefault(x => x.AssignmentGUID == r.AssignmentGUID);

    if (guid == null)
    {
        // this is done on the context not the DB, unresolved..(excuted)
        unitOfWorkTAFeedBack.ReportsRepository.Add(r); // an insert on the repository
        //insertedcount++; u would need a lock
    }
    else
    {
        if (guid.StatusId == statusid)
        {
            r.ReportsID = guid.ReportsID;
            // this is done on the context not the DB, unresolved..(excuted)
            unitOfWorkTAFeedBack.ReportsRepository.Update(r); // update on the repo
            //updatedcount++; u would need a lock
        }
        else
        {
            //ignoredcount++; u would need a lock
        }
    }
});

the issue here... as reportdataindatabase can contain the same key twice.. and the context is only updated after the fact aka when it get here..

unitOfWorkTAFeedBack.Commit();

it may have been called twice for the same entity as above (commit) is where the work is... doing the add/update above in Parallel wont save you any real time, as that part is quick..

//takes 80 seconds to update 1287 records... does seem long... //List reportdataindatabase = unitOfWorkTAFeedBack.ReportsRepository.List().ToList();

//PS Add how reports are retrieved.. you want something like

TAFeedBackContext db = new TAFeedBackContext();
var remoteReports = DatafromAnotherPLace //include how this was retrieved;
var localReports = TAFeedBackContext.Reports.ToList(); //these are tracked.. (by default)
foreach (var item in remoteReports)
{
    //i assume more than one is invalid.
    var localEntity = localReports.SingleOrDefault(x => x.AssignmentGUID == item.AssignmentGUID); 
    if (localEntity == null)
    {
        //add as it doenst exist 
        TAFeedBackContext.Reports.Add(new Report() { *set fields* });       
    }
    else
    {
        if (localEntity.StatusId == statusid) //only update if status is the passed in status.
        {
            //why are you modifying the remote entity
            item.ReportsID = localEntity.ReportsID;
            
            //update remove entity?, i get the impression its from a different context,
            //if not then cool, but you need to show how reports is retrieved
            
        }
        else
        {
            
        }

    }

} 

TAFeedBackContext.SaveChanges();

The remote reports data are coming from an OData API. Microsoft Project Server 2016 uses a SharePoint based approach, and stores it's Projects data in its own database. However, I have been given some network credentials to this API I have to work with. But getting data is easy, the only problem is, this API set to my own Language (Hungarian, the data fields are mapped to Hungarian), therefore I have to revert it to English. That is fine, I do the mapping and data retrieve fast. — Newbie1001, Jan 13 '21 at 14:41
I need to set r.ReportsId in the remote. Without it, it wont work. The reason is, in my database ( the second, the one "in the context") the key field is the ReportsID instead of AssignmentGUID. Because in Project Server (from I get the data), the key field is AssignmentGUID, ReportsID does not exist. That is why I search by AssignmentGUID, then determine the ReportsID of the object, and set it in the original which matches the AssignmentGUID. — Newbie1001, Jan 13 '21 at 15:04
@Newbie1001 the part which is confusing is how are you able to talk to the DB where Report is from on the same context as "reportdataindatabase", some how you are mixing access.... aka in your example you loop over reports, but this entity type come from OData API, and then you bind to a local context (unitOfWorkTAFeedBack). why use OData API if you can access through dbcontext then? — Seabizkit, Jan 14 '21 at 08:07
@Newbie1001 update the question and make it *crystal clear* that the data comes from Project Server, not a database, and you use OData. The question is meaningless in its current form. It's not even about EF - you're using OData, not Entity Framework at all — Panagiotis Kanavos, Jan 14 '21 at 08:13
No matter what, `Parallel.ForEach` is a bug in itself and *slows things down*. It's not even needed. OData allows using multiple IDs in a query. Even if you execute 100 SELECTs in parallel, Project Server will still have to execute 100 SELECTs against its database. DbContext is *NOT* thread-safe, for any operation. Even for in-memory modifications. And putting a "repository" over DbContext that only renames EF's own methods is wasteful at best. — Panagiotis Kanavos, Jan 14 '21 at 08:22
@Seabizkit now the OP explained the data comes from Project Server through OData, all of the code is pointless. A simple bulk insert is all that's needed, assuming copying is needed in the first place — Panagiotis Kanavos, Jan 14 '21 at 08:41

Problem trying to clone a Project Server database using OData and Entity Framework

1 Answers1