Entity Framework Core Performance Optimization to ingest very large folder of the same table

Question

I have a background service which in C# which ingest 3600 xml files (5Gb file size in total) to SQL Server database. The duration to complete the ingestion is around 16 hours. I use hangfire to create 3 jobs/threads and each job will have one folder to ingest, Folder A, B, C.

Problem is Folder C is very extra heavy. My idea is split the files in Folder C into two folders, folder C1 and folder C2. So now, I have 4 jobs/threads, Folder A, B, C1 and C2. But the problem is C1 and C2 job hit database error below which I believe due to they both assessing to the same table.

An exception occurred in the database while saving changes for context type 'xxxContext'. System.InvalidOperationException: A second operation started on this context before a previous operation completed. This is usually caused by different threads using the same instance of DbContext

and another time with this error:

An exception occurred in the database while saving changes for context type 'xxxContext'. System.InvalidOperationException: Collection was modified; enumeration operation may not execute.

and error from hangfire is below:

Hangfire.Storage.DistributedLockTimeoutException Timeout expired. The timeout elapsed prior to obtaining a distributed lock on the 'HangFire:IIngestService.IngestPersonXML' resource.

Hangfire.Storage.DistributedLockTimeoutException: Timeout expired. The timeout elapsed prior to obtaining a distributed lock on the 'HangFire:IIngestService.IngestPersonXML' resource.

When I use Parallel.ForEach, I also get this error:

System.InvalidOperationException: 'Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.'

I only need to insert into db. No update or delete operation needed. Is there any workaround for this?

Each thread should have it's own DbContext also recreate DbContext on every file. For sure for speed, you have to use third party extensions, for example https://github.com/linq2db/linq2db.EntityFrameworkCore - `context.BulkCopy(aLotOfitems)` — Svyatoslav Danyliv, Jan 27 '21 at 13:30
@SvyatoslavDanyliv quite the opposite - neither multiple threads nor extensions that essentially tread objects as rows. That's what `context.BulkCopy` does - it converts the objects into a DataReader and feeds them to `SqlBulkCopy`. Using an ORM in this case offers nothing at all, and adds significant overhead. It would be a lot faster to just send data to SqlBulkCopy directly — Panagiotis Kanavos, Jan 27 '21 at 21:29
@PanagiotisKanavos, it is what it does. It sends data to SqlBulkCopy directly. If BulkCopy options are set correctly. For other databases, which do not support such low-level bulk insert, it will use other strategy. — Svyatoslav Danyliv, Jan 27 '21 at 21:53
There is detailed explanation what is under hood: https://linq2db.github.io/articles/sql/Bulk-Copy.html — Svyatoslav Danyliv, Jan 27 '21 at 21:59
@SvyatoslavDanyliv why go through EF at all then? *Nothing* is gained by involving an ORM. All that SqlBulkCopy needs is an IDataReader, something that can be created out of any collection. Nothing is gained by copying the objects deserialized from XML into a DbContext, only to pull them out again and feed them to SqlBulkCopy. Why not avoid the duplicate objects and the overhead? — Panagiotis Kanavos, Jan 27 '21 at 22:11
@SvyatoslavDanyliv ETL is one of the areas where ORMs are a hindrance, not a help. There are no objects and business logic involved there - the domain objects are Record, Cell, Source, Target, Transformation. The best strategy is to create a pipeline of steps that process a stream of records without caching everything in memory — Panagiotis Kanavos, Jan 27 '21 at 22:15
@PanagiotisKanavos, I don't know, it is modern, popular, well documented ;) I have never used EF because of our library and I never create compromises. Queries and everything should fast as possible. — Svyatoslav Danyliv, Jan 27 '21 at 22:16
@PanagiotisKanavos, actually with `linqt2db` doing ETL like playing with LEGO. — Svyatoslav Danyliv, Jan 27 '21 at 22:19
@Steve how do you deserialize the XML files? How complex is the schema? Which database are you targeting? EF Core and any ORM are unsuitable in this case, so none of the strategies you considered is going to help. Most databases have a way to import XML data. Many databases have ways to bulk-import data remotely - SQL Server, MySQL and PostgreSQL .NET providers offer this functionality already — Panagiotis Kanavos, Jan 27 '21 at 22:19
@PanagiotisKanavos, I use the `XmlSerializer` class which has `Deserialize` method in it. The schema is deeply nested. I wrote some for loop to "denormalize" it for the database table. I am using SQL Server 2019 database. — Steve, Jan 27 '21 at 22:22
@Steve SQL Server can import XML files already, through `OPENROWSET` [as shown here](https://www.mssqltips.com/sqlservertip/5707/simple-way-to-import-xml-data-into-sql-server-with-tsql/). That's probably the easiest and probably fastest option. If the objects are already in memory you can use [FastMember's ObjectReader](https://github.com/mgravell/fast-member#ever-needed-an-idatareader) to create an IDataReader wrapper you can feed to SqlBulkCopy directly. — Panagiotis Kanavos, Jan 27 '21 at 22:29
@Steve you can also use XDocument instead of XmlSerializer to parse each document and transform it using LINQ operators. To speed things up, *after* you remove the ORM and work with SqlBulkCopy directly, don't rush to use multiple concurrent connections. SqlBulkCopy sends the data as fast as possible, especially if you tell it to use a table lock, and uses minimal logging while inserting the data. What you can do is execute the parsing, transformation and bulk import as different steps in a pipeline, using different threads. This way you could be loading one file, parsing another and insert — Panagiotis Kanavos, Jan 27 '21 at 22:31
@PanagiotisKanavos, do not forget that XML file should be visible for SQL Server ;) Nevermind find solutions again and again, spend time, create DataReaders. During this time Python scripters already imported everything. — Svyatoslav Danyliv, Jan 27 '21 at 22:35
@PanagiotisKanavos, the reason I use `XMLSerializer` is because I use the option `Paste Special` > `Paste XML as Classes` which will auto-generate all the models of that XML and after `Deserialize` it with `XMLSerializer`, I get a collection of it to be used later to write my for loop to denormalize it. Maybe if I could insert everything using `OPENROWSET`, I can denormalize it. I think that approach seem easier? — Steve, Jan 27 '21 at 22:37
@PanagiotisKanavos, okay I will try out that way you suggested. Thanks a lot. — Steve, Jan 27 '21 at 22:41
@Steve, nice, you have chosen hardest way. Enjoy for several days ;) — Svyatoslav Danyliv, Jan 27 '21 at 22:42
@SvyatoslavDanyliv, I will check out your library also. Thanks to you too. — Steve, Jan 27 '21 at 22:44
@Steve, do that at first. It is much easier than playing with database specific API. You have enumeration of objects - put into `BulkCopy`. — Svyatoslav Danyliv, Jan 27 '21 at 22:48
@SvyatoslavDanyliv, I will try that out first. Yeah, we might change to another database in future :D — Steve, Jan 27 '21 at 22:50
@SvyatoslavDanyliv, I tried to use `BulkCopy`. It can be executed with no error but the database is not showing up the data. This is my new [post](https://stackoverflow.com/questions/65932701/linq2db-bulkcopyasync-without-error-but-no-data-inserted-in-the-database) — Steve, Jan 28 '21 at 07:30

score 1 · Accepted Answer · answered Jan 27 '21 at 22:34

EF in not meant for this sort of operations. Use SqlBulCopy for that.
There are libraries providing it seamlessly for EF, but you can write your own implementation aswell - it's not that complex

Really don't get this part

I only need to insert into db. Do update or delete operation needed. Is there any workaround for this?

So do you need to update or not? Well.. if you need to update a bunch of rows, insert them with bulk copy into temp table and then just do join update.

Sorry. Typo. It should be `Do update or delete operation needed` — Steve, Jan 27 '21 at 22:43

Entity Framework Core Performance Optimization to ingest very large folder of the same table

1 Answers1