4

This is for C# 3.5

I have ICollection that I'm trying to split into separate ICollections where the delimiter is a sequence.

For example

ICollection<byte> input = new byte[] { 234, 12, 12, 23, 11, 32, 23, 11 123, 32 };
ICollection<byte> delimiter = new byte[] {23, 11};
List<IICollection<byte>> result = input.splitBy(delimiter);

would result in

result.item(0) = {234, 12, 12};
result.item(1) = {32};
result.item(2) = {123, 32};
Corey S
  • 45
  • 1
  • 7
  • 1
    @Ben: possibly, but not necessarily. I've had to do similar things in the real world. – AllenG Jun 20 '11 at 19:31
  • 1
    Here's a novel idea. At least for the case of `byte`'s, you could maybe convert it to an ASCII string and use `Split()` to do the splitting. Not sure if it will work for all cases but sounds good in theory. – Jeff Mercado Jun 20 '11 at 19:33
  • @AllenG true, but at least he could write down some of his thoughts about this small task couldn't he? ;) – ba__friend Jun 20 '11 at 19:37
  • @ba__friend: sure. But I think I've dropped one or two of these "Here's my question. Context? What context?" so I'm willing to cut some slack. – AllenG Jun 20 '11 at 19:51
  • @Ben: smells like _teen spirit_. FTFY! :-) – Doctor Jones Jun 20 '11 at 19:52
  • Let's wait maybe he'll add something ;) – ba__friend Jun 20 '11 at 19:57
  • @Ben @AllenG @ba_friend It's not homework, it's for a project I"m working on. The byte information is an encoded string concatenated with the byte data of a file. @Jeff I thought about doing that but I did not want to be converting to a string and back. – Corey S Jun 21 '11 at 09:38

5 Answers5

4
private static IEnumerable<IEnumerable<T>> Split<T>
    (IEnumerable<T> source, ICollection<T> delimiter)
{
    // window represents the last [delimeter length] elements in the sequence,
    // buffer is the elements waiting to be output when delimiter is hit

    var window = new Queue<T>();
    var buffer = new List<T>();

    foreach (T element in source)
    {
        buffer.Add(element);
        window.Enqueue(element);
        if (window.Count > delimiter.Count)
            window.Dequeue();

        if (window.SequenceEqual(delimiter))
        {
            // number of non-delimiter elements in the buffer
            int nElements = buffer.Count - window.Count;
            if (nElements > 0)
                yield return buffer.Take(nElements).ToArray();

            window.Clear();
            buffer.Clear();
        }
    }

    if (buffer.Any())
        yield return buffer;
}
mqp
  • 70,359
  • 14
  • 95
  • 123
  • Wouldn't `Queue` be better than `LinkedList`? And why don't you use `foreach`? – svick Jun 20 '11 at 19:49
  • 1
    This yields the same buffer for all sequences so `Split(source, delimiter).ToList()` would return a list of the same sequence, the last one parsed. – Rick Sladkey Jun 20 '11 at 19:52
  • @svick: Those are good points, I think both would be accurate criticisms. I'll update my answer in a moment. @Rick: Oops, error in transcription. I'll put my ToArray back. – mqp Jun 20 '11 at 19:52
  • hmm... SequenceEqual() O(n), Clear() O(n), Take(), O(nrOfItem), ToArray() O(n) right? – Magnus Jun 20 '11 at 20:01
  • Sure, but all of those are O(n) on the length of the delimiter. One would expect that the common case for this is to be searching through a long sequence for a relatively small delimiter, so I'm not very concerned about the complexity of those things. I would bet a lot of money that `SequenceEqual` is consuming the great majority of the time for long searches here, and if you wanted to speed it up, you'd be best off making it smarter so that it didn't have to compare the sequence upon finding each new element. However, a solution like that would be longer and more complicated. – mqp Jun 20 '11 at 20:45
  • @mquander maybe, maybe not. I wouldn't asume anything. It can be done better. – Magnus Jun 20 '11 at 20:49
  • Neat solution, but there are some potential edge case issues. What if your source was identical to the delimiter sequence? Shouldn't you get two empty sequences in the result? – Rob Jun 20 '11 at 21:00
  • @Magnus: I don't think there's any maybe not, I think my comment was an accurate representation of the performance. @Rob: I don't really think so, personally -- I would expect an empty sequence of sequences coming back. I agree that someone should think about what they want in the edge cases before writing this. – mqp Jun 21 '11 at 01:14
  • `"e".Split('e')` yields two empty strings. In the same way, I'd expect this case (where `source` equals `delimiter`) to yield two empty sequences. I'd guess (didn't bother to check) that you can just remove the final `if (buffer.Any())` condition, and always return the buffer, empty or not. – Rob Jun 21 '11 at 08:57
  • @Rob: That would yield one empty sequence, but not two. I like it this way, since I think it corresponds to the most useful behavior of `string.Split` (that is, `RemoveEmptyEntries`) but I also think there would be a lot of merit to extending it to behave exactly like `string.Split`, as Jeff Mercado did in his answer. – mqp Jun 21 '11 at 13:24
2

An optimal solution would not be using SequenceEqual() to check each subrange, otherwise you could potentially be iterating the length of the delimiter for every item in the sequence which could hurt performance, especially for large delimiter sequences. It could be checked as the source sequence is enumerated instead.

Here's what I'd write but there's always room for improvement. I aimed to have similar semantics to String.Split().

public enum SequenceSplitOptions { None, RemoveEmptyEntries }
public static IEnumerable<IList<T>> SequenceSplit<T>(
    this IEnumerable<T> source,
    IEnumerable<T> separator)
{
    return SequenceSplit(source, separator, SequenceSplitOptions.None);
}
public static IEnumerable<IList<T>> SequenceSplit<T>(
    this IEnumerable<T> source,
    IEnumerable<T> separator,
    SequenceSplitOptions options)
{
    if (source == null)
        throw new ArgumentNullException("source");
    if (options != SequenceSplitOptions.None
     && options != SequenceSplitOptions.RemoveEmptyEntries)
        throw new ArgumentException("Illegal option: " + (int)option);
    if (separator == null)
    {
        yield return source.ToList();
        yield break;
    }

    var sep = separator as IList<T> ?? separator.ToList();
    if (sep.Count == 0)
    {
        yield return source.ToList();
        yield break;
    }

    var buffer = new List<T>();
    var candidate = new List<T>(sep.Count);
    var sindex = 0;
    foreach (var item in source)
    {
        candidate.Add(item);
        if (!item.Equals(sep[sindex]))
        {   // item is not part of the delimiter
            buffer.AddRange(candidate);
            candidate.Clear();
            sindex = 0;
        }
        else if (++sindex >= sep.Count)
        {   // candidate is the delimiter
            if (options == SequenceSplitOptions.None || buffer.Count > 0)
                yield return buffer.ToList();
            buffer.Clear();
            candidate.Clear();
            sindex = 0;
        }
    }
    if (candidate.Count > 0)
        buffer.AddRange(candidate);
    if (options == SequenceSplitOptions.None || buffer.Count > 0)
        yield return buffer;
}
Jeff Mercado
  • 129,526
  • 32
  • 251
  • 272
1
public IEnumerable<IEnumerable<T>> SplitByCollection<T>(IEnumerable<T> source, 
                                                        IEnumerable<T> delimiter)
{
    var sourceArray = source.ToArray();
    var delimiterCount = delimiter.Count();

    int lastIndex = 0;

    for (int i = 0; i < sourceArray.Length; i++)
    {
        if (delimiter.SequenceEqual(sourceArray.Skip(i).Take(delimiterCount)))
        {
            yield return sourceArray.Skip(lastIndex).Take(i - lastIndex);

            i += delimiterCount;
            lastIndex = i;
        }
    }

    if (lastIndex < sourceArray.Length)
        yield return sourceArray.Skip(lastIndex);
}

Calling it ...

var result = SplitByCollection(input, delimiter);

foreach (var element in result)
{
    Console.WriteLine (string.Join(", ", element));
}

returns

234, 12, 12
32
123, 32
ulrichb
  • 19,610
  • 8
  • 73
  • 87
  • 1
    This is surely the right idea, but I really hope you would not seriously suggest this implementation. It's not reasonable to enumerate the source sequence over and over, and the algorithm has terrible complexity because of it. – mqp Jun 20 '11 at 19:32
  • `.Count()` is an O(n) operation – Magnus Jun 20 '11 at 19:33
0

Here is my take on it:

public static IEnumerable<IList<byte>> Split(IEnumerable<byte> input, IEnumerable<byte> delimiter)
{
    var l = new List<byte>();
    var set = new HashSet<byte>(delimiter);
    foreach (var item in input)
    {
        if(!set.Contains(item))
            l.Add(item);
        else if(l.Count > 0)
        {
            yield return l;
            l = new List<byte>();
        }
    }
    if(l.Count > 0)
        yield return l;
}
Magnus
  • 45,362
  • 8
  • 80
  • 118
-1

There are probably better methods, but here's one I've used before: it's fine for relatively small collections:

byte startDelimit = 23;
byte endDelimit = 11;
List<ICollection<byte>> result = new List<ICollection<byte>>();
int lastMatchingPosition = 0;
var inputAsList = input.ToList();

for(int i = 0; i <= inputAsList.Count; i++)
{
    if(inputAsList[i] == startDelimit && inputAsList[i + 1] == endDelimit)
    {
        ICollection<byte> temp = new ICollection<byte>();
        for(int j = lastInputPosition; j <= i ; j++)
        {
            temp.Add(inputAsList[j]);
        }
        result.Add(temp);
        lastMatchingPosition = i + 2;
    }
}

I don't have my IDE open at the moment, so that my not compile as-is, or may have some holes you'll need to plug. But it's where I start when I run into this problem. Again, as I said before, if this is for large collections, it'll be slow- so better solutions may yet exist.

AllenG
  • 8,112
  • 29
  • 40
  • What if the delimiter is three bytes long? Or ten? – svick Jun 20 '11 at 19:41
  • You'd need to change the logic suitably. Using the collection of bytes as a delimiter, you _could_ enumerate over that vs. your main list each time inputAsList[i] == your first delimiter, but that would slow it down even more. – AllenG Jun 20 '11 at 19:43
  • I think I said that. It does assume a small collection. – AllenG Jun 20 '11 at 19:48
  • I would not define only a size of `2` for a delimiter as small. What if you needed `3`, `4`, `5`? Surely those are small as well. – Jeff Mercado Jun 20 '11 at 19:53
  • Ah. That part. I addressed that in a comment. Since he was only looking for two in the question, that's the assumption I used in my answer. If he'd been looking for more than that then, I probably would have changed it a little. But, no, as it currently exists, it doesn't scale all that well. Thus my disclaimers. – AllenG Jun 20 '11 at 19:56
  • Goes without saying, but you can't construct an object of type `ICollection`. – mqp Jun 20 '11 at 19:59