0

I'm experiencing this kind of problem for the first time. I'm running an application in a container with some hosted services (AspNet, Core C# 8.0, Docker container Linux). The first service do a big data ingestion in my database. I will explain the service: it just start, download from the web and download a 2gb file with a lot of JSON inside. Then it add one by one every JSON file deserialized in my database. It starts pretty well, but after 500.000 records starting to slow up and does not end the data ingestion. If I run

docker logs -f <container>

I get this message:

enter image description here

With docker stats, I see my container using all the resource possible. Thanks a lot for any assistance!

This is the HS:

    private void DoWork(object state)
            {
                using (IServiceScope scope = _services.CreateScope())
                {
                    executionCount++;
                    IVUManager _vuManager = scope.ServiceProvider.GetRequiredService<IVUManager>();
                    _vuManager.BulkLoaderVuMassive();
                }
          }

This is the code part managing the JSON file:

           foreach (string filename in filesnumber)
            {
                using (StreamReader r = new StreamReader(filename))
                {
                    //Parsin JSON
                    aInt++;
                    string jsonFull = r.ReadToEnd();
                    JsonDocument jsonDoc = JsonDocument.Parse(jsonFull);
                    JsonElement jsonParsed = jsonDoc.RootElement;
                    ASoVu newVU = new ASoVu();

                    newVU = _importStrategy.VU();

                    _repository.MassiveImportNVD(_context, newVUL)

                    if (aInt == 9999)
                    {
                        _repository.Save(_context);
                    }

                }
            }

This is the code part that make the insert inside the database:

public class VURepository : IVURepository { private readonly ILogger _logger;

public VURepository(ILogger<VURepository> logger)
{
    _logger = logger;
}

public void MassiveImport(VMContext _context, ASoVu newVU)
{
    _context.Vu.Add(newVU);
}
Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
Rob None
  • 479
  • 2
  • 9
  • 22
  • can you show the code that inserts into the database? and the stuff that registers the database connections at the DI container – Stefan Jan 09 '20 at 08:10
  • You didn't post any code. 500K is tiny data - databases work with *TBs* of data and billions of records. It's not the database or the container, you don't even need multiple threads to parse and import files efficiently. If I had to guess I'd say it's a combination of inefficient code and bad table design: parsing the entire file in memory instead of using streams for example, perhaps row-by-agonizing-row (RBAR)inserts instead of using the database's bulk import mechanism. Perhaps bad concurrency - having 100 threads trying to insert data doesn't increase speed by 100 due to collissions – Panagiotis Kanavos Jan 09 '20 at 08:12
  • BTW using a *web server* to import data is extremely inefficient too. Web servers are built to serve web requests and use threadpool threads to server those requests. If you post a ton of files that need 10 minutes each to process, you block all the thread without actually gaining anything. That's why long-running tasks should always execute as BackgroundService instances – Panagiotis Kanavos Jan 09 '20 at 08:16
  • I have similar cases. What I do is separate uploading from importing - quickly save the files into temporary storage in the request and post a request to a [BackgroundService](https://learn.microsoft.com/en-us/aspnet/core/fundamentals/host/hosted-services?view=aspnetcore-3.1&tabs=visual-studio)-derived class. That class uses an [ActionBlock](https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.dataflow.actionblock-1?view=netcore-3.1) internally to process files using a limited number of threads (quite often just 1). – Panagiotis Kanavos Jan 09 '20 at 08:20
  • 1
    Parsing/importing depends on the scenario. One way, especially for smaller files (MBs), would be to parse the data using Json.NET's deserializer and bulk import them in SQL Server using SqlBulkCopy or whatever the database product provides. For larger files, deserializing everything harms performance and JsonReader should be used to read elements in a SAX-like manner. Again the records are consumed by SqlBulkCopy – Panagiotis Kanavos Jan 09 '20 at 08:23
  • A simpler and possibly *faster* solution would be to export the files to CSV/flat files as an intermediate step and use the database's bulk insert mechanisms (eg BULK INSERT or bcp in SQL Server) to import them. – Panagiotis Kanavos Jan 09 '20 at 08:34
  • 1
    PS: `Repository` and `MassiveImport` are contradictory terms. You don't need (actually shouldn't even think of using) repositories in bulk import scenarios. There are no objects, entities or relations in that domain. It's rows, transformations, sources, and targets. ORMs don't use bulk import mechanisms, so their INSERT statements are already far slower. *The posted code* though caches the entire data set multiple times - as a full string, as a parsed JsonDocument, as individual entities. For a 2GB file, that's at least 6GB of wasted RAM – Panagiotis Kanavos Jan 09 '20 at 08:37
  • It looks like you're using System.Text.Json instead of Json.NET. That's bad and good - bad because it's not a general-purpose deserializer (it can't even handle non-UTF8 text), good because when used *properly* it avoids allocations. Properly in this case means using a [Utf8JsonReader directly](https://docs.microsoft.com/en-us/dotnet/standard/serialization/system-text-json-how-to?view=netcore-3.1#use-utf8jsonreader) to read directly from a `ReadOnlySequence` produced from the *file* stream directly. That requires complex coding though to get a sequence from a stream through pipelines – Panagiotis Kanavos Jan 09 '20 at 08:55
  • 1
    `JsonDocument` is disposable. You need to dispose of it via a `using var jsonDoc = JsonDocument.Parse(jsonFull);` statement. You don't seem to be using `jsonParsed` at all, but if you were you would need to restrict use of it to within the `using` scope of the document, or clone it. – dbc Jan 09 '20 at 09:35
  • I'm sorry guys, I'm not that experienced like you so I didn't understand most of your tips...@dbc I'm using jsonParsed for other thing, and it's already in a using...@Panagiotis Kanavos how i can optimize it? I'm really going crazy with this issue... – Rob None Jan 09 '20 at 10:30
  • Is there no way to give more resources to the application (or to the container)? I forgot to mention that this process will run only one time, at the start of the container. Then never again. – Rob None Jan 09 '20 at 10:32
  • @RobNone - can you share an example of the JSON, and an example of how you are processing it? Do you have a large number of files, is each file large, or both? A [mcve] would increase the chances we could help you. – dbc Jan 09 '20 at 16:38
  • Regarding: `I'm using jsonParsed for other thing, and it's already in a using` To clarify, the `JsonDocument` you get from `JsonDocument.Parse` needs to be in a using statement (and hence disposed) since that is what is `IDisposable`. Your `jsonParsed` is a `JsonElement` which doesn't need to be in a using/disposed. – ahsonkhan Jan 11 '20 at 03:50

0 Answers0