0

I have been tasked with matching 1.7 million records with some results which have been passed to me in a csv file.

little bit of background to the below code, i have two lists...

Certs which contains 5 properties with ID being the equivalent of a PK.

Orders which contains a a list of ID's which should be contained in the certs list.

I need to match the two and do something with those Cert objects which are found.

            foreach (Classes.CertOrder.IDS OrderUnitID in Order.AllIDs)
            {
                var Cert = (from C in Certs where C.ID.ToUpper() == OrderUnitID.ID.ToUpper() select C).FirstOrDefault();
                if (Cert != null)
                {
                    Output.add(Cert)
                    OrderUnitID.fulfilled = true;
                }

            }

This code works but its super slow (to be expected i guess with the amount of records) Is there any way i can speed this up?

Edit to Add, would love to be able to add the data to a SQL server to run the queries however the data is such that it is not allowed to leave the workstation on which the file is being processed or even allowed to touch the disk in an un-encrypted form.

In combination with the helpful answer below i have change my output to be list based, pre-sorted both lists by ID and now the processing takes seconds rather than hours! Thanks stack overflow!

  • 1
    *Is there any way i can speed this up* yes, many. Which one is best, well your going to have to do some research. Look into `Task.Run()` could be one option, or `Parrellel.ForEach`, will this be "faster", maybe, you'll have to test it. Of course if you loaded this CSV into something like SQL it could crunch this data a lot faster. `Dictionary` might speed up the lookup but you'd have to weight this against the overhead of creating the Dictionary initially. Basically this is too broad – Liam Nov 30 '18 at 11:14
  • If you expect not so many matches, then cache all Cert ids, do a search on ids and if you have matach then make a query to pull Cert object by id. – Renatas M. Nov 30 '18 at 11:17
  • There should only be one match in Certs to order.allids however the Certs list contains the results of many orders – Jon Bridger Nov 30 '18 at 11:19
  • Are there duplicates in `Certs`? I mean, items with the same `ID`? – Dennis Nov 30 '18 at 11:21
  • There are no duplicates in either of the ID lists. – Jon Bridger Nov 30 '18 at 11:22
  • Fetch Cert.ids before loop. On match fetch Cert object. You will do 2 queries. Now you are doing Order.AllIDs.Count queries. This might save the processing time. – Renatas M. Nov 30 '18 at 11:26

3 Answers3

1

Buid a dictionary from Certs:

var certsMapping = Certs
    .ToDictionary(_ => _.ID.ToUpper());

foreach (Classes.CertOrder.IDS OrderUnitID in Order.AllIDs)
{
    if (certMapping.TryGetValue(OrderUnitID.ID.ToUpper(), out var cert))
    {
        Output.add(cert);
        OrderUnitID.fulfilled = true;
    }
}
Dennis
  • 37,026
  • 10
  • 82
  • 150
  • Thanks Dennis, the dictionary is certainly helping a lot i'm now up to processing thousands per minute rather than hundreds. Its still going to take a while to churn through 1.7 mil records but at least now its going to finish before i retire. – Jon Bridger Nov 30 '18 at 11:42
  • What list has 1.7M records? Post both lists sizes, please. – Dennis Nov 30 '18 at 11:45
  • Certs contains 1.7mil, the all order ids contains between 90k and 350k IDs depending on the order object being processed. Would the additional processing of removing a Cert from the Certs list be worth the time, e.g. as a cert is matched and processed remove it from the list as it wont be used or needed again. – Jon Bridger Nov 30 '18 at 11:50
  • Hmm. Dictionary lookup works *very* fast. There's no need to remove items from dictionary - it won't perform faster. What is `Output`? Is setting `OrderUnitID.fulfilled` is just setting a `bool` (without side effects)? I mean, just iterating through `List` (if `Order.AllIDs` is a list) and dictionary lookup shouldn't take much time. – Dennis Nov 30 '18 at 12:00
  • Sorry output.add was just an example line the actual line contains security sensitive info. It basically adds some properties of the cert object and a masterkey to a string which will be later output to a csv. Order.AllIDs is a list of IDS objects which contain the ID as a string and a bool of fulfilled. – Jon Bridger Nov 30 '18 at 12:09
  • @JonBridger: well, it looks like the next optimization is to make `Output.add(cert)` faster. You can't avoid loop on `Order.AllIDs`, dictionary lookup is fast. `Output.add(cert)` is the remainder. ) – Dennis Nov 30 '18 at 12:12
  • Thanks for your help and input Dennis it’s much appreciated. – Jon Bridger Nov 30 '18 at 12:25
0

Why database lookups are faster? one of the reason are indexes

you can use i4o to create an index for in-memory lists.

And then use a parallel for-each loop to speed things up.

Mihir Dave
  • 3,954
  • 1
  • 12
  • 28
0

Extending on the accepted answer a bit a few other options. OrdinalIgnoreCase gives you the best single-threaded performance while Parallelizing it gives the best overall performance.

class Item { public string Id { get; set; } }

class Program
{
    private static Random rng = new Random();
    private static string characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
    static void Main(string[] args)
    {
        var list = Enumerable.Range(1, 2_700_000)
            .Select(x => string.Join("", Enumerable.Range(5, rng.Next(20)).Select(y => characters[rng.Next(0, characters.Length)])))
            .Distinct(StringComparer.OrdinalIgnoreCase)
            .Select(x => new {Order = rng.Next(), Item = new Item {Id = x }})
            .OrderBy(x => x.Order)
            .Select(x => x.Item)
            .ToList();

        Console.WriteLine("Master List Size: {0}", list.Count);

        var matches = list.Take(350_000).Select(x => x.Id).ToList();

        Console.WriteLine("Matches List Size: {0}", matches.Count);

        var dict = list.ToDictionary(x => x.Id, x => x, StringComparer.CurrentCultureIgnoreCase);

        var results = new List<Item>();

        var sw = new Stopwatch();

        Console.WriteLine("CurrentCultureIgnoreCase Elapsed Time (avg): {0}",
            Enumerable.Range(1, 10).Select(x =>
            {
                sw.Start();

                foreach (var m in matches)
                    if (dict.TryGetValue(m, out var item))
                        results.Add(item);

                sw.Stop();

                var t = sw.ElapsedMilliseconds;

                sw.Reset();

                return t;
            }).Average());


        dict = list.ToDictionary(x => x.Id.ToUpper(), x => x);

        Console.WriteLine("ToUpper() Elapsed Time (avg): {0}",
            Enumerable.Range(1, 10).Select(x =>
            {
                sw.Start();

                foreach (var m in matches)
                    if (dict.TryGetValue(m.ToUpper(), out var item))
                        results.Add(item);

                sw.Stop();

                var t = sw.ElapsedMilliseconds;

                sw.Reset();

                return t;
            }).Average());


        dict = list.ToDictionary(x => x.Id, x => x, StringComparer.OrdinalIgnoreCase);

        Console.WriteLine("OrdinalIgnoreCase Elapsed Time (avg): {0}",
            Enumerable.Range(1, 10).Select(x =>
            {
                sw.Start();

                foreach (var m in matches)
                    if (dict.TryGetValue(m, out var item))
                        results.Add(item);

                sw.Stop();

                var t = sw.ElapsedMilliseconds;

                sw.Reset();

                return t;
            }).Average());
    }
}

var cDict = new ConcurrentDictionary<string,Item>(dict);
var cResults = new ConcurrentBag<Item>();

Console.WriteLine("Parallel Elapsed Time (avg): {0}",
    Enumerable.Range(1, 10).Select(x =>
    {
        sw.Start();

        Parallel.ForEach(matches, new ParallelOptions{MaxDegreeOfParallelism = 20}, m =>
        {
            if (cDict.TryGetValue(m, out var item))
                cResults.Add(item);
        });
        sw.Stop();

        var t = sw.ElapsedMilliseconds;

        sw.Reset();

        return t;
    }).Average());

Results

 Master List Size: 2158882
 Matches List Size: 350000
 CurrentCultureIgnoreCase Elapsed Time (avg): 298.2
 ToUpper() Elapsed Time (avg): 179.6
 OrdinalIgnoreCase Elapsed Time (avg): 163.9
 Parallel Elapsed Time (avg): 74.6
GaveUp
  • 46
  • 3