4

My specific requirement is that I have an IEnumerable<IEnumerable<string>>, and I want to "take" all items in the outer enumeration except any "empty" trailing items, where "empty" means all strings are null/empty or the inner enumeration is empty. Note that I want to keep any empty items that appear before the last non-empty one. For example:

Item 1: a, b, c
Item 2: (nothing)
Item 3: a, f, g, h
Item 4: (nothing)
Item 5: (nothing)

I would like to keep items 1–3 but trim items 4 and 5.

In a more general sense, I have an enumeration of items, where I want to trim any trailing items that satisfy a condition, that appear behind the last item that does not satisfy the condition.

For the sake of choosing an appropriate solution, I might add that the outer enumeration will usually contain a few hundred up to a few hundred thousand items, while the inner enumerations contain only a few items each. There will probably be only a couple of empty items that I need to trim.

My current solution puts all outer items in a list (after transforming them with .Select(...)), and then in a loop keep removing the last item if it's empty, until a non-empty item is found.

Kjell Rilbe
  • 1,331
  • 14
  • 39
  • Can you share your current solution? Also can you add an example on [.netfiddle](https://dotnetfiddle.net/)? – aloisdg Mar 30 '17 at 16:42
  • 1
    The typical solution for this type of issue is to do something like `Reverse().SkipWhile(x => !x.Any()).Reverse()` – juharr Mar 30 '17 at 16:45
  • Proper solution is to use actual types instead of embedded IEnumerables, which could perform better than O(n) if done correctly. In this case, probably the most efficient would be to Reverse() and iterate, tracking the count, until you find a non-empty, then immediately break, then Take() the length - the count. Probably not much more efficient than your current solution, tho. –  Mar 30 '17 at 16:45
  • Not the most efficient solution: you can reverse your collection/enumerable, skip while elements are empty, then reverse it again. – Anatoliy Pidlubnyy Mar 30 '17 at 16:45
  • It can be done with `Reverse().SkipWhile().Reverse()`, but it would be quite inefficient. You'd better stay with your approach or create custom iterator method, rather than standard LINQ. – Ivan Stoev Mar 30 '17 at 16:46
  • 2
    @Will, so `s.Take(s.Reverse().TakeWhile(e => !e.Any()).Count())`. Would be interesting to see how that performs compared to the reverse, skip while, reverse approach. – juharr Mar 30 '17 at 16:49
  • @juharr Yeah, that's better for sure. Still the `Reverse` has to buffer the whole sequence. – Ivan Stoev Mar 30 '17 at 16:56
  • @juharr I haven't looked at how Reverse is implemented, but I suspect it's pretty well optimized. Doing a double reverse with a TakeWhile in between, that seems like you'd lose any possible optimizations. –  Mar 30 '17 at 17:19

2 Answers2

4

There is no standard efficient LINQ solution. I would go with a custom extension "LINQ like" method like this:

public static class EnumerableExtensions
{
    public static IEnumerable<T> SkipLastWhile<T>(this IEnumerable<T> source, Func<T, bool> predicate)
    {
        var skipBuffer = new List<T>();
        foreach (var item in source)
        {
            if (predicate(item))
                skipBuffer.Add(item);
            else
            {
                if (skipBuffer.Count > 0)
                {
                    foreach (var skipped in skipBuffer)
                        yield return skipped;
                    skipBuffer.Clear();
                }
                yield return item;
            }
        }
    }
}

It requires additional space for buffering the longest item sequence satisfying the skip predicate while the LINQ Reverse method has to buffer the whole input sequence.

The usage will be:

var result = input.SkipLastWhile(e => !e.Any());
Ivan Stoev
  • 195,425
  • 15
  • 312
  • 343
  • Oddly enough, this performs even worse than the double-reverse solution: http://share.linqpad.net/xvutib.linq – StriplingWarrior Mar 30 '17 at 17:21
  • @StriplingWarrior I can't believe that. Also I can't see your test, if you are using `List` as input, the `Reverse` is probably optimized to do a reverse `for` loop with `yiled return`. Try with real enumerable, like `Enumerable.Range`. – Ivan Stoev Mar 30 '17 at 17:25
  • I actually thought the same thing, so I created an enumerable based on Enumerable.Range().SelectMany(). The test is meant to run in LINQPad. – StriplingWarrior Mar 30 '17 at 18:36
  • 1
    @StriplingWarrior I did a real test with exe compiled in Release mode, with code warm up, Prefer 32 bit off etc. The above beats both LINQ variants up to 2 times and of course uses much less memory. Which means the LINQ is not so bad :) Still, from code reusability standpoint I would go with custom extension method and implement it either with LINQ or custom, and optimize when needed. – Ivan Stoev Mar 30 '17 at 18:40
  • Yeah, it's likely that by running in LINQPad instead of an exe I'm getting skewed results. – StriplingWarrior Mar 30 '17 at 18:54
  • A [dotnetfiddle](https://dotnetfiddle.net/XUlUS4) shows similar results, though. Would you mind sharing your benchmark code? – StriplingWarrior Mar 30 '17 at 19:09
  • @StriplingWarrior Here is my (quite quick and dirty) code https://dotnetfiddle.net/FkUp3j. When running from exe with the build setting mentioned, on my machine it shows `A:00:00:00.0344138 B:00:00:00.0612396 C:00:00:00.0478791`. You can notice that I've chosen a sequence which is averagely bad for the above algorithm. If I use `var data = Enumerable.Range(1, 1000000).Select(_ => Enumerable.Repeat("A", 10)) .Concat(Enumerable.Repeat(Enumerable.Empty(), 5));` (the best case), the times are `A:00:00:00.0606338 B:00:00:00.3524680 C:00:00:00.2097964` – Ivan Stoev Mar 30 '17 at 19:27
  • 1
    That's interesting, because in the fiddle you posted it takes longer. There must be some interesting optimizations when you compile in Release mode and run locally. On a totally different topic, dotnetfiddle needs to apply some anti-profanity filters to their hashing algorithm. :-) – StriplingWarrior Mar 30 '17 at 21:56
  • 1
    @IvanStoev: Thanks! I like this approach because it's reusable and should perform well in most situations. The function I need this in can receive its data from "anywhere", so I can't assume the underlying collection is a `List<...>`, which should make a solution that avoids reverse more likely to be generically efficient. Interesting that performance test results are so different depending on where/how the tests are executed. – Kjell Rilbe Mar 31 '17 at 04:28
3

How about this?

var trimmedItems = items.Reverse().SkipWhile(e => !e.Any()).Reverse();

If you have very large sets of data, this will require more memory than some other solutions you could come up with, but it's pretty easy to read and follow.

juharr's suggestion is only slightly more complicated, and performs much better if you have large numbers of items:

var trimmedItems = items.Take(items.Reverse().TakeWhile(e => !e.Any()).Count());

Here's the benchmarking code I'm using. It's meant to run in LINQPad, but you can change the result.Dump(); call to output the results to the console or something if you prefer. Also, I'm using an IEnumerable<string> instead of IEnumerable<IEnumerable<string>> just for simplicity, but that shouldn't impact the performance of the algorithm:

/* This is a benchmarking template I use in LINQPad when I want to do a
 * quick performance test. Just give it a couple of actions to test and
 * it will give you a pretty good idea of how long they take compared
 * to one another. It's not perfect: You can expect a 3% error margin
 * under ideal circumstances. But if you're not going to improve
 * performance by more than 3%, you probably don't care anyway.*/
void Main()
{
    // Enter setup code here
    var items = new[] { "a, b, c",
    "",
    "a, f, g, h",
    "",
    ""}.AsEnumerable();
    var manyitems = Enumerable.Range(1, 10000).SelectMany(i => items);

    var actions = new[]
    {
        new TimedAction("Control", () =>
        {
            // ToList() is the one thing that all of these have to do.
            manyitems.ToList();
        }),
        new TimedAction("Reverse().SkipWhile().Reverse()", () =>
        {
            manyitems.Reverse().SkipWhile(e => !e.Any()).Reverse().ToList();
        }),
        new TimedAction("Take(Reverse().TakeWhile().Count())", () =>
        {
            manyitems.Take(manyitems.Reverse().TakeWhile(e => !e.Any()).Count()).ToList();
        }),
        new TimedAction("SkipLastWhile", () =>
        {
            manyitems.SkipLastWhile(e => !e.Any()).ToList();
        }),
        // Add tests as desired
    };
    const int TimesToRun = 100; // Tweak this as necessary
    TimeActions(TimesToRun, actions);
}

public static class EnumerableExtensions
{
    public static IEnumerable<T> SkipLastWhile<T>(this IEnumerable<T> source, Func<T, bool> predicate)
    {
        var skipBuffer = new List<T>();
        foreach (var item in source)
        {
            if (predicate(item))
                skipBuffer.Add(item);
            else
            {
                foreach (var skipped in skipBuffer)
                    yield return skipped;
                skipBuffer.Clear();
                yield return item;
            }
        }
    }
}

#region timer helper methods
// Define other methods and classes here
public void TimeActions(int iterations, params TimedAction[] actions)
{
    Stopwatch s = new Stopwatch();
    int length = actions.Length;
    var results = new ActionResult[actions.Length];
    // Perform the actions in their initial order.
    for (int i = 0; i < length; i++)
    {
        var action = actions[i];
        var result = results[i] = new ActionResult { Message = action.Message };
        // Do a dry run to get things ramped up/cached
        result.DryRun1 = s.Time(action.Action, 10);
        result.FullRun1 = s.Time(action.Action, iterations);
    }
    // Perform the actions in reverse order.
    for (int i = length - 1; i >= 0; i--)
    {
        var action = actions[i];
        var result = results[i];
        // Do a dry run to get things ramped up/cached
        result.DryRun2 = s.Time(action.Action, 10);
        result.FullRun2 = s.Time(action.Action, iterations);
    }
    results.Dump();
}

public class ActionResult
{
    public string Message { get; set; }
    public double DryRun1 { get; set; }
    public double DryRun2 { get; set; }
    public double FullRun1 { get; set; }
    public double FullRun2 { get; set; }
}

public class TimedAction
{
    public TimedAction(string message, Action action)
    {
        Message = message;
        Action = action;
    }
    public string Message { get; private set; }
    public Action Action { get; private set; }
}

public static class StopwatchExtensions
{
    public static double Time(this Stopwatch sw, Action action, int iterations)
    {
        sw.Restart();
        for (int i = 0; i < iterations; i++)
        {
            action();
        }
        sw.Stop();

        return sw.Elapsed.TotalMilliseconds;
    }
}
#endregion

Results:

Benchmark Results

The benchmark results are more profound if your IEnumerable is backed by a List, because then LINQ can make some additional optimizations on Reverse():

var manyitems = Enumerable.Range(1, 10000).SelectMany(i => items).ToList().AsEnumerable();

Benchmark Results with List-backed IEnumerable

StriplingWarrior
  • 151,543
  • 27
  • 246
  • 315
  • You should include your performance code and findings in the answer. – juharr Mar 30 '17 at 18:20
  • @juharr: Good idea. How's this? – StriplingWarrior Mar 30 '17 at 18:48
  • 1
    @StriplingWarrior: Thanks! I like this approach, with the `.Count()`, because it's short and easy to read/understand, but chose @IvanStoev's solution as my answer because of the factors I commented there. All in all I'm overwhelmed with the quick and complete response I got to this question. Many many thanks to you all! – Kjell Rilbe Mar 31 '17 at 04:30