0

Consider I have a class

class Employee
{
    public string Id { get; set; }
    public string Type { get; set; }
    public string Identifier { get; set; }
    public object Resume { get; set; }
    public DateTime StartDate { get; set; }
    public DateTime EndDate { get; set; }
}
List<Employee> employees = LoadEmployees(); //Around 2.5 million to 3 millions employees
employees = employees
                .Where(x => x.Identifier != null)
                .OrderBy(x => x.Identifier)
                .ToArray();

I have a requirement where I want to load and sort around 2.5 million employees in memory but the Linq query gets stuck on the OrderBy clause. Any pointers on this? I have created this Employee class just to simplify my problem.

meJustAndrew
  • 6,011
  • 8
  • 50
  • 76
Lizzy
  • 101
  • 7
  • What is wrong with what you have now? – Igor Mar 04 '19 at 21:54
  • My query gets stuck on var employees= employees.OrderBy(x => x.Identifier) .Where(x => x.Identifier != null) .ToArray(); – Lizzy Mar 04 '19 at 21:55
  • Also what is `LoadEmployees` and what is the return type? Can you sort them at the source (if LoadEmployees calls a database to retrieve the records for example). Also can you filter at the source and apply the `Identifier != null` condition at the source to the returned records? – Igor Mar 04 '19 at 21:55
  • 2
    Put the `.Where()` before the `.OrderBy()`. Also, change the `Employees` type to `IEnumerable` and remove as many `.ToArray()` calls as you can, especially within the `LoadEmployees()` method. – Joel Coehoorn Mar 04 '19 at 21:58
  • I am not getting the employees from a database. I am parsing it from a file – Lizzy Mar 04 '19 at 21:58
  • 1
    At the very least you could switch the `OrderBy` and the `Where`. – Igor Mar 04 '19 at 21:58
  • What do you intend to do with the resulting array? – Igor Mar 04 '19 at 21:59
  • That should be `ToList`, not `ToArray` if you change the code that way. Or you can use a new type. – Igor Mar 04 '19 at 22:03
  • It is also recommended to specify a `StringComparer`, in this case probably `OrdinalIgnoreCase`. That will make the sorting more efficient on string. – Igor Mar 04 '19 at 22:03
  • I want to apply some logic and then save it – Lizzy Mar 04 '19 at 22:04
  • In that case do not call `ToList` or `ToArray`. Instead do your logic on the returned `IEnumerable` and save it back to a/the `Stream`. That could also speed everything up. `var filteredEmployees = employees.Where(x => x.Identifier != null).OrderBy(x => x.Identifier, StringComparer.OrdinalIgnoreCase);` – Igor Mar 04 '19 at 22:05
  • Linq methods are notoriously slow. Try implementing a custom algorithm. – miara Mar 04 '19 at 22:10
  • 3
    @JeromeBaek I would really doubt this is the case... – meJustAndrew Mar 04 '19 at 22:12
  • Try sorting the list in place, using the List.Sort() method. That's an O(log N) algorithm, where I'd reckon LINQ OrderBy is an O(N log N) algorithm. – Jesse de Wit Mar 04 '19 at 22:58
  • @JessedeWit Why in the world would you expect LINQ to use an N^2 sorting algorithm? It doesn't, because there's no good reason to do something so silly. It uses a quicksort. – Servy Mar 04 '19 at 23:02
  • N log N, I meant, With LINQ, it'll have to iterate the remaining records to find the lowest for each record. While sorting in place is more efficient, because you can keep moving the next item in the list to the left until it falls in the right place. – Jesse de Wit Mar 04 '19 at 23:05
  • I take back everything I said: https://stackoverflow.com/a/1832713/3883866 – Jesse de Wit Mar 04 '19 at 23:13
  • Lizzy, you can take another look at the answer, since it was updated with the best performance enhancements that I was able to make and see if it helps you. – meJustAndrew Mar 05 '19 at 08:29
  • @JessedeWit Moving the next item to the left until it falls in place is a *very* inefficient sorting algorithm. No production code should ever use that approach. Yes, the LINQ method needs to materialize the whole collection in order to compute the first value. While something to be aware of, that does't make it's sort any less efficient than one of an already materialized collection. – Servy Mar 05 '19 at 14:26
  • Adding StringComparer.OrdinalIgnoreCase definately helped. – Lizzy Mar 05 '19 at 20:11

1 Answers1

3

I would use the .Where(x => x.Identifier != null) clause first, since it filters some data first and then do the OrderBy. Given the fact that you have only ~2.5 million records and that they are only basic types like string and DateTime, then you should not have any problems with the memory in this case.

Edit:

I have just ran your code as a sample and indeed it is a matter of seconds (like over 15 seconds on my machine which does not have a very powerful CPU, but still, it does not get stuck):

List<Employee> employees = new List<Employee>();
for(int i=0;i<2500000;i++)
{
    employees.Add(new Employee
    {
        Id = Guid.NewGuid().ToString(),
        Identifier = Guid.NewGuid().ToString(),
        Type = i.ToString(),
        StartDate = DateTime.MinValue,
        EndDate = DateTime.Now
    });
}

var newEmployees = employees
    .Where(x => x.Identifier != null)
    .OrderBy(x => x.Identifier)
    .ToArray();

As a second edit, I have just ran some tests, and it seems that an implementation using Parallel Linq can be in some cases faster with about 1.5 seconds than the serial implementation:

var newEmployees1 = employees.AsParallel()
    .Where(x => x.Identifier != null)
    .OrderBy(x => x.Identifier)
    .ToArray();

And these are the best numbers that I got:

7599 //serial implementation
5752 //parallel linq

But the parallel tests could variate from one machine to another so I suggest making some tests yourself and if you still find a problem about this, then maybe edit the question/post another one.

Using the hint that @Igor proposed in the comment below, the parallel implementation with StringComparer.OrdinalIgnoreCase is about three times faster than the simple parallel implementation. The final (fastest) code looks like this:

var employees = employees.AsParallel()
    .Where(x => x.Identifier != null)
    .OrderBy(x => x.Identifier, StringComparer.OrdinalIgnoreCase)
    .ToArray();
meJustAndrew
  • 6,011
  • 8
  • 50
  • 76
  • 2
    *hint* - If you do the string comparison using `StringComparer.OrdinalIgnoreCase`? (`.OrderBy(x => x.Identifier, StringComparer.OrdinalIgnoreCase)`) you will decrease the time dramatically. – Igor Mar 04 '19 at 22:25
  • @Igor it indeed runs way faster using `StringComparer.OrdinalIgnoreCase` in the `OrderBy`. Thanks, I have updated the answer! – meJustAndrew Mar 05 '19 at 08:24
  • Thanks for all the comments , I am going to incorporate all these changes – Lizzy Mar 05 '19 at 20:08