16

All string.Split methods seems to return an array of strings (string[]).

I'm wondering if there is a lazy variant that returns an IEnumerable<string> such that one for large strings (or an infinite length IEnumerable<char>), when one is only interested in a first subsequences, one saves computational effort as well as memory. It could also be useful if the string is constructed by a device/program (network, terminal, pipes) and the entire strings is thus not necessary immediately fully available. Such that one can already process the first occurences.

Is there such method in the .NET framework?

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555

7 Answers7

6

You could easily write one:

public static class StringExtensions
{
    public static IEnumerable<string> Split(this string toSplit, params char[] splits)
    {
        if (string.IsNullOrEmpty(toSplit))
            yield break;

        StringBuilder sb = new StringBuilder();

        foreach (var c in toSplit)
        {
            if (splits.Contains(c))
            {
                yield return sb.ToString();
                sb.Clear();
            }
            else
            {
                sb.Append(c);
            }
        }

        if (sb.Length > 0)
            yield return sb.ToString();
    }
}

Clearly, I haven't tested it for parity with string.split, but I believe it should work just about the same.

As Servy notes, this doesn't split on strings. That's not as simple, and not as efficient, but it's basically the same pattern.

public static IEnumerable<string> Split(this string toSplit, string[] separators)
{
    if (string.IsNullOrEmpty(toSplit))
        yield break;

    StringBuilder sb = new StringBuilder();
    foreach (var c in toSplit)
    {
        var s = sb.ToString();
        var sep = separators.FirstOrDefault(i => s.Contains(i));
        if (sep != null)
        {
            yield return s.Replace(sep, string.Empty);
            sb.Clear();
        }
        else
        {
            sb.Append(c);
        }
    }

    if (sb.Length > 0)
        yield return sb.ToString();
}
Steven Evers
  • 16,649
  • 19
  • 79
  • 126
  • This only splits on characters, not strings. – Servy Jan 27 '15 at 20:06
  • Note that when given an empty string, this will produce an empty enumerable, whereas `string.Split` returns a `string[1] { "" }` (an array with an empty string). – Ghost4Man Aug 07 '19 at 15:28
4

There is no such thing built-in. Regex.Matches is lazy if I interpret the decompiled code correctly. Maybe you can make use of that.

Or, you simply write your own split function.

Actually, you could image most string functions generalized to arbitrary sequences. Often, even sequences of T, not just char. The BCL does not emphasize that at generalization all. There is no Enumerable.Subsequence for example.

usr
  • 168,620
  • 35
  • 240
  • 369
  • I wish .NET had included an "immutable array of T" type; `String` could then simply be shorthand for "immutable array of char". I know there are many times I would have used "immutable array of Byte" or "immutable array of Int32" if they existed, and would expect generalization would be useful in many other cases as well. – supercat Jan 27 '15 at 20:01
  • @supercat: true, that's how Haskell handles strings. It enables generalizing a lot of string methods to lists... – Willem Van Onsem Jan 27 '15 at 20:04
  • @supercat There's `IReadOnlyList`. A `string` could be an `IReadOnlyList`, it's just that it wasn't around in .NET 1.0. – Servy Jan 27 '15 at 20:05
  • @Servy: but if I recall correctly, in .NET 1.0 a string wasn't an `IEnumerable` either. So one can slightly modify the design I guess? – Willem Van Onsem Jan 27 '15 at 20:06
  • @Servy a string can't have virtual methods. That would allow for arbitrary change of semantics. That's a very brittle model for such a fundamental type. Also, under the old CAS security model that would open up all kinds of holes. – usr Jan 27 '15 at 20:17
  • @usr `string` is `sealed`. Even if it had virtual methods, you couldn't inherit from them. You could also avoid having virtual methods by explicitly implementing the interface. You could also do other things like create an implicit conversion to that type, in which you returned an `internal` wrapper around the `char[]` that *did* implement that interface, without having `string` implement the interface. – Servy Jan 27 '15 at 20:19
  • @Servy I was talking about potentially using `IReadOnlyList` instead of string. A hypothetical scenario. I though we were talking about that. BCL code could never accept a `IReadOnlyList` instead of a sealed string. – usr Jan 27 '15 at 20:21
  • @usr I thought you just meant having `string` implement `IReadOnlyList` so that you could treat it as a list when you wanted to. – Servy Jan 27 '15 at 20:23
  • @Servy: The `IReadOnlyList` interface is rather anemic, and provides neither a promise of immutability nor an efficient means of exporting a range of items to an array. Code which receives a `String` can safely assume its contents won't change, but there's no nice equivalent for a sequence of `Byte` or a sequence of `Int32`. – supercat Jan 27 '15 at 20:49
4

Nothing built-in, but feel free to rip my Tokenize method:

 /// <summary>
/// Splits a string into tokens.
/// </summary>
/// <param name="s">The string to split.</param>
/// <param name="isSeparator">
/// A function testing if a code point at a position
/// in the input string is a separator.
/// </param>
/// <returns>A sequence of tokens.</returns>
IEnumerable<string> Tokenize(string s, Func<string, int, bool> isSeparator = null)
{
    if (isSeparator == null) isSeparator = (str, i) => !char.IsLetterOrDigit(str, i);

    int startPos = -1;

    for (int i = 0; i < s.Length; i += char.IsSurrogatePair(s, i) ? 2 : 1)
    {
        if (!isSeparator(s, i))
        {
            if (startPos == -1) startPos = i;
        }
        else if (startPos != -1)
        {
            yield return s.Substring(startPos, i - startPos);
            startPos = -1;
        }
    }

    if (startPos != -1)
    {
        yield return s.Substring(startPos);
    }
}
Cory Nelson
  • 29,236
  • 5
  • 72
  • 110
1

There is no built-in method to do this as far as I'm know. But it doesn't mean you can't write one. Here is a sample to give you an idea:

public static IEnumerable<string> SplitLazy(this string str, params char[] separators)
{
    List<char> temp = new List<char>();
    foreach (var c in str)
    {
        if (separators.Contains(c) && temp.Any())
        {
             yield return new string(temp.ToArray());
             temp.Clear();
        }
        else
        {
            temp.Add(c);
        }
    }
    if(temp.Any()) { yield return new string(temp.ToArray()); }
}

Ofcourse this doesn't handle all cases and can be improved.

Selman Genç
  • 100,147
  • 13
  • 119
  • 184
1

I wrote this variant which supports also SplitOptions and count. It behaves same like string.Split in all test cases I tried. The nameof operator is C# 6 sepcific and can be replaced by "count".

public static class StringExtensions
{
    /// <summary>
    /// Splits a string into substrings that are based on the characters in an array. 
    /// </summary>
    /// <param name="value">The string to split.</param>
    /// <param name="options"><see cref="StringSplitOptions.RemoveEmptyEntries"/> to omit empty array elements from the array returned; or <see cref="StringSplitOptions.None"/> to include empty array elements in the array returned.</param>
    /// <param name="count">The maximum number of substrings to return.</param>
    /// <param name="separator">A character array that delimits the substrings in this string, an empty array that contains no delimiters, or null. </param>
    /// <returns></returns>
    /// <remarks>
    /// Delimiter characters are not included in the elements of the returned array. 
    /// If this instance does not contain any of the characters in separator the returned sequence consists of a single element that contains this instance.
    /// If the separator parameter is null or contains no characters, white-space characters are assumed to be the delimiters. White-space characters are defined by the Unicode standard and return true if they are passed to the <see cref="Char.IsWhiteSpace"/> method.
    /// </remarks>
    public static IEnumerable<string> SplitLazy(this string value, int count = int.MaxValue, StringSplitOptions options = StringSplitOptions.None, params char[] separator)
    {
        if (count <= 0)
        {
            if (count < 0) throw new ArgumentOutOfRangeException(nameof(count), "Count cannot be less than zero.");
            yield break;
        }

        Func<char, bool> predicate = char.IsWhiteSpace;
        if (separator != null && separator.Length != 0)
            predicate = (c) => separator.Contains(c);

        if (string.IsNullOrEmpty(value) || count == 1 || !value.Any(predicate))
        {
            yield return value;
            yield break;
        }

        bool removeEmptyEntries = (options & StringSplitOptions.RemoveEmptyEntries) != 0;
        int ct = 0;
        var sb = new StringBuilder();
        for (int i = 0; i < value.Length; ++i)
        {
            char c = value[i];
            if (!predicate(c))
            {
                sb.Append(c);
            }
            else
            {
                if (sb.Length != 0)
                {
                    yield return sb.ToString();
                    sb.Clear();
                }
                else
                {
                    if (removeEmptyEntries)
                        continue;
                    yield return string.Empty;
                }

                if (++ct >= count - 1)
                {
                    if (removeEmptyEntries)
                        while (++i < value.Length && predicate(value[i]));
                    else
                        ++i;
                    if (i < value.Length - 1)
                    {
                        sb.Append(value, i, value.Length - i);
                        yield return sb.ToString();
                    }
                    yield break;
                }
            }
        }

        if (sb.Length > 0)
            yield return sb.ToString();
        else if (!removeEmptyEntries && predicate(value[value.Length - 1]))
            yield return string.Empty;
    }

    public static IEnumerable<string> SplitLazy(this string value, params char[] separator)
    {
        return value.SplitLazy(int.MaxValue, StringSplitOptions.None, separator);
    }

    public static IEnumerable<string> SplitLazy(this string value, StringSplitOptions options, params char[] separator)
    {
        return value.SplitLazy(int.MaxValue, options, separator);
    }

    public static IEnumerable<string> SplitLazy(this string value, int count, params char[] separator)
    {
        return value.SplitLazy(count, StringSplitOptions.None, separator);
    }
}
Rafael
  • 11
  • 2
0

I wanted the functionality of Regex.Split, but in a lazily evaluated form. The code below just runs through all Matches in the input string, and produces the same results as Regex.Split:

public static IEnumerable<string> Split(string input, string pattern, RegexOptions options = RegexOptions.None)
{
    // Always compile - we expect many executions
    var regex = new Regex(pattern, options | RegexOptions.Compiled);

    int currentSplitStart = 0;
    var match = regex.Match(input);

    while (match.Success)
    {
        yield return input.Substring(currentSplitStart, match.Index - currentSplitStart);

        currentSplitStart = match.Index + match.Length;
        match = match.NextMatch();
    }

    yield return input.Substring(currentSplitStart);
}

Note that using this with the pattern parameter @"\s" will give you the same results as string.Split().

Simon MᶜKenzie
  • 8,344
  • 13
  • 50
  • 77
  • Just a note for a readers. When using this code in production, move Regex definition out of method scope. Otherwise regex compilation will occure on every Split execution – AlfeG Jan 27 '21 at 08:45
0

Lazy split without create tempory string.

Chunk of string copied using system coll mscorlib String.SubString.

public static IEnumerable<string> LazySplit(this string source, StringSplitOptions stringSplitOptions, params string[] separators)
{
    var sourceLen = source.Length;

    bool IsSeparator(int index, string separator)
    {
        var separatorLen = separator.Length;

        if (sourceLen < index + separatorLen)
        {
            return false;
        }

        for (var i = 0; i < separatorLen; i++)
        {
            if (source[index + i] != separator[i])
            {
                return false;
            }
        }

        return true;
    }

    var indexOfStartChunk = 0;

    for (var i = 0; i < source.Length; i++)
    {
        foreach (var separator in separators)
        {
            if (IsSeparator(i, separator))
            {
                if (indexOfStartChunk == i && stringSplitOptions != StringSplitOptions.RemoveEmptyEntries)
                {
                    yield return string.Empty;
                }
                else
                {
                    yield return source.Substring(indexOfStartChunk, i - indexOfStartChunk);
                }

                i += separator.Length;
                indexOfStartChunk = i--;
                break;
            }
        }
    }

    if (indexOfStartChunk != 0)
    {
        yield return source.Substring(indexOfStartChunk, sourceLen - indexOfStartChunk);
    }
}
Ivan
  • 241
  • 1
  • 10