3

I am writing a small program that takes in a .csv file as input with about 45k rows. I am trying to compare the contents of this file with the contents of a table on a database (SQL Server through dynamics CRM using Xrm.Sdk if it makes a difference).

In my current program (which takes about 25 minutes to compare - the file and database are the exact same here both 45k rows with no differences), I have all existing records on the database in a DataCollection<Entity> which inherits Collection<T> and IEnumerable<T>

In my code below I am filtering using the Where method and then doing a logic based the count of matches. The Where seems to be the bottleneck here. Is there a more efficient approach than this? I am by no means a LINQ expert.

foreach (var record in inputDataLines)
{
    var fields = record.Split(',');

    var fund = fields[0];
    var bps = Convert.ToDecimal(fields[1]);
    var withdrawalPct = Convert.ToDecimal(fields[2]);
    var percentile = Convert.ToInt32(fields[3]);
    var age = Convert.ToInt32(fields[4]);
    var bombOutTerm = Convert.ToDecimal(fields[5]);

    var matchingRows = existingRecords.Entities.Where(r => r["field_1"].ToString() == fund
                                      && Convert.ToDecimal(r["field_2"]) == bps
                                      && Convert.ToDecimal(r["field_3"]) == withdrawalPct
                                      && Convert.ToDecimal(r["field_4"]) == percentile
                                      && Convert.ToDecimal(r["field_5"]) == age);

    entitiesFound.AddRange(matchingRows);

    if (matchingRows.Count() == 0)
    {
        rowsToAdd.Add(record);
    }
    else if (matchingRows.Count() == 1)
    {
        if (Convert.ToDecimal(matchingRows.First()["field_6"]) != bombOutTerm)
        {
            rowsToUpdate.Add(record);
            entitiesToUpdate.Add(matchingRows.First());
        }
    }
    else
    {
        entitiesToDelete.AddRange(matchingRows);
        rowsToAdd.Add(record);
    }
}

EDIT: I can confirm that all existingRecords are in memory before this code is executed. There is no IO or DB access in the above loop.

Ben
  • 2,518
  • 4
  • 18
  • 31
  • 3
    Looks like you're doing 45k * 45k comparisons, that does mount up. Still, 25 minutes seems too long. Can you confirm that existingRecords is in-memory, not needing any file (db) access? – H H Oct 17 '17 at 11:16
  • @HenkHolterman yes I can confirm that all existingRecords are in memory before this code is executed. There is no IO or DB access in the above loop. – Ben Oct 17 '17 at 11:22
  • In that case @HenkHolterman my assumption in giving answer was wrong. – Ipsit Gaur Oct 17 '17 at 11:22
  • @IpsitGaur - yes, that's why I asked. – H H Oct 17 '17 at 11:25

3 Answers3

5

Himbrombeere is right, you should execute the query first and put the result into a collection before you use Any, Count, AddRange or whatever method will execute the query again. In your code it's possible that the query is executed 5 times in every loop iteration.

Watch out for the term deferred execution in the documentation. If a method is implemented in that way, then it means that this method can be used to construct a LINQ query(so you can chain it with other methods and at the end you have a query). But only methods that don't use deferred execution like Count, Any, ToList(or a plain foreach) will actually execute it. If you dont want that the whole query is executed everytime and you have to access this query multiple times , it's better to store the result in a collection(.f.e with ToList).

However, you could use a different approach which should be much more efficient, a Lookup<TKey, TValue> which is similar to a dictionary and can be used with an anonymous type as key:

var lookup = existingRecords.Entities.ToLookup(r => new 
{
    fund = r["field_1"].ToString(),
    bps = Convert.ToDecimal(r["field_2"]),
    withdrawalPct =  Convert.ToDecimal(r["field_3"]),
    percentile = Convert.ToDecimal(r["field_4"]),
    age = Convert.ToDecimal(r["field_5"])
});

Now you can access this lookup in the loop very efficiently.

foreach (var record in inputDataLines)
{
    var fields = record.Split(',');
    var fund = fields[0];
    var bps = Convert.ToDecimal(fields[1]);
    var withdrawalPct = Convert.ToDecimal(fields[2]);
    var percentile = Convert.ToInt32(fields[3]);
    var age = Convert.ToInt32(fields[4]);
    var bombOutTerm = Convert.ToDecimal(fields[5]);

    var matchingRows = lookup[new {fund, bps, withdrawalPct, percentile, age}].ToList();

    entitiesFound.AddRange(matchingRows);

    if (matchingRows.Count() == 0)
    {
        rowsToAdd.Add(record);
    }
    else if (matchingRows.Count() == 1)
    {
        if (Convert.ToDecimal(matchingRows.First()["field_6"]) != bombOutTerm)
        {
            rowsToUpdate.Add(record);
            entitiesToUpdate.Add(matchingRows.First());
        }
    }
    else
    {
        entitiesToDelete.AddRange(matchingRows);
        rowsToAdd.Add(record);
    }
}

Note that this will work even if the key does not exist(an empty list is returned).

Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
  • Nitpick, but won't compile because anonymous type in `lookup` is not the same you pass to its indexer. – Evk Oct 17 '17 at 11:53
  • 1
    Using the lookup type as shown drastically decreased processing time. The posted code was taking about 25 minutes. With the look up it has reduced to 20 seconds. – Ben Oct 17 '17 at 13:22
4

Add a ToList after your Convert.ToDecimal(r["field_5"]) == age);-line to force an immediate execution of the query.

var matchingRows = existingRecords.Entities.Where(r => r["field_1"].ToString() == fund
                                  && Convert.ToDecimal(r["field_2"]) == bps
                                  && Convert.ToDecimal(r["field_3"]) == withdrawalPct
                                  && Convert.ToDecimal(r["field_4"]) == percentile
                                  && Convert.ToDecimal(r["field_5"]) == age)
                    .ToList();

The Where doesn´t actually execute your query, it just prepares it. The actual execution happens later in a delayed way. In your case that happens when calling Count which itself will iterate the entire collection of items. But if the first condition fails, the second one is checked leading to a second iteration of the complete collection when calling Count. In this case you actually execute that query a thrird time when calling matchingRows.First().

When forcing an immediate execution you´re executing the query only once and thus iterating the entire collection only once also which will decrease your overall-time.

MakePeaceGreatAgain
  • 35,491
  • 6
  • 60
  • 111
  • 2
    Yes, this will reduce executing it from originally 6 times to once per input line. Still the algorithm is O(N * M), i.e. slow. – Ivan Stoev Oct 17 '17 at 11:11
  • I did not know this... I've added this and it has drastically increased the performance from the previous 25 minutes to 9 minutes currently. Is this the best performance I can expect to achieve with this algorithm? – Ben Oct 17 '17 at 11:27
  • @Ben it _may_ be faster to fetch all rows from database into memory once and perform search there (in memory, via Dictionary or similar). – Evk Oct 17 '17 at 11:33
  • @Ben: look at the documentation of the LINQ methods in MSDN. There you find the term _deferred_ in every method that does not execute but build a query. Those methods can be chained without executing every single method immediately, but they have a "drawback". If you forget to execute them you will always execute them whenever you _use_ them (with methods that are not using deferred execution like `Count`, `Any`, `ToList` or a simple `foreach`-loop). That happens here. – Tim Schmelter Oct 17 '17 at 11:33
  • @Ben: OP has mentioned that all records are already in memory. But that doesn't change anything because he is executing the (in-memory) query up to 5 times in every iteration of the loop. – Tim Schmelter Oct 17 '17 at 11:35
  • @TimSchmelter then putting all stuff into Dictionary (keyed by string `r["field_1"].ToString() + r["field_2"]` etc) might help. – Evk Oct 17 '17 at 11:36
0

Another option, which is basically along the same lines as the other answers, is to prepare your data first, so that you're not repeatedly calling things like r["field_2"] (which are relatively slow to look up).

This is a (1) clean your data, (2) query/join your data, (3) process your data approach.

Do this:

(1)

var inputs =
    inputDataLines
        .Select(record =>
        {
            var fields = record.Split(',');
            return new
            {
                fund = fields[0],
                bps = Convert.ToDecimal(fields[1]),
                withdrawalPct = Convert.ToDecimal(fields[2]),
                percentile = Convert.ToInt32(fields[3]),
                age = Convert.ToInt32(fields[4]),
                bombOutTerm = Convert.ToDecimal(fields[5]),
                record
            };
        })
        .ToArray();

var entities =
    existingRecords
        .Entities
        .Select(entity => new
        {
            fund = entity["field_1"].ToString(),
            bps = Convert.ToDecimal(entity["field_2"]),
            withdrawalPct = Convert.ToDecimal(entity["field_3"]),
            percentile = Convert.ToInt32(entity["field_4"]),
            age = Convert.ToInt32(entity["field_5"]),
            bombOutTerm = Convert.ToDecimal(entity["field_6"]),
            entity
        })
        .ToArray()
        .GroupBy(x => new
        {
            x.fund,
            x.bps,
            x.withdrawalPct,
            x.percentile,
            x.age
        }, x => new
        {
            x.bombOutTerm,
            x.entity,
        });

(2)

var query =
    from i in inputs
    join e in entities on new { i.fund, i.bps, i.withdrawalPct, i.percentile, i.age } equals e.Key
    select new { input = i, matchingRows = e };

(3)

foreach (var x in query)
{
    entitiesFound.AddRange(x.matchingRows.Select(y => y.entity));

    if (x.matchingRows.Count() == 0)
    {
        rowsToAdd.Add(x.input.record);
    }
    else if (x.matchingRows.Count() == 1)
    {
        if (x.matchingRows.First().bombOutTerm != x.input.bombOutTerm)
        {
            rowsToUpdate.Add(x.input.record);
            entitiesToUpdate.Add(x.matchingRows.First().entity);
        }
    }
    else
    {
        entitiesToDelete.AddRange(x.matchingRows.Select(y => y.entity));
        rowsToAdd.Add(x.input.record);
    }
}

I would suspect that this will be the among the fastest approaches presented.

Enigmativity
  • 113,464
  • 11
  • 89
  • 172