2

NOTE: I did ask the same question here but since some people have marked it as duplicate though it had some crafty, neat solutions, I had to create this extra(dupe) question to make it easier for others who are facing similar doubts. Added the question based on the suggestion of fellow stack overflow members.

What is the efficient way to parse through a large delimited string so that I can access just one element from the delimited set without having to store the other substrings involved?

I specifically am not interested in storing the rest of the element values as done when using Split() method since all of this information is irrelevant to the problem at hand. Also, I want to save memory in doing the same.

Problem Statement:
Given the exact delimited position, I need to extract the element contained in that given position in the most efficient way in terms of memory consumed and time taken.

Simple example string: "1,2,3,4,....,21,22,23,24"
Delimter: ,
Delimited Position: 22
Answer expected: 23

Another example string: "61d2e3f6-bcb7-4cd1-a81e-4f8f497f0da2;0;192.100.0.102:4362;2014-02-14;283;0;354;23;0;;;""0x8D15A2913C934DE"";Thursday, 19-Jun-14 22:58:10 GMT;"
Delimiter: ;
Delimited Position: 7
Expected Answer: 23

Andrew Morton
  • 24,203
  • 9
  • 60
  • 84
re3el
  • 735
  • 2
  • 12
  • 28
  • Did you consider that using `split()` is time/memory consuming? if so, you should [edit] your question and adding time/memory comparasions using the various way you had tried. for example, if you used regex, add the regex expression and the time/memory used, add the `split()` you used, the time/memory used too. This task is very demaning in your application? – Mauricio Arias Olave Jan 18 '19 at 21:33
  • Before you spend a lot of time optimizing, make sure you're solving a problem that is actually a problem. You have at a minimum two billion bytes of address space available; what fraction of that are you "wasting" with your Split? If the answer is 0.00001%, then you might ask yourself whether it is worth spending even one minute trying to solve this problem. *Collection pressure* is much more likely to be the real problem than *virtual memory usage*. – Eric Lippert Jan 18 '19 at 22:51

5 Answers5

2

There are some useful remarks relevant to this problem in the documentation for String.Split, although I wrote the following before discovering that.

One way to do it is to find a delimiter with String.IndexOf method - you can specify the index to start the search from, so it is possible to skip along the items without having to examine every character. (The examination of every character happens behind the scenes, but it's a little bit faster than doing it yourself.)

I made up an extension method by adding a new class named "ExtensionMethods.cs" to the solution with this content:

namespace ExtensionMethods
{
    public static class MyExtensions
    {
        /// <summary>
        /// Get the nth item from a delimited string.
        /// </summary>
        /// <param name="s">The string to retrieve a delimited item from.</param>
        /// <param name="delimiter">The character used as the item delimiter.</param>
        /// <param name="n">Zero-based index of item to return.</param>
        /// <returns>The nth item or an empty string.</returns>
        public static string Split(this string s, char delimiter, int n)
        {

            int pos = pos = s.IndexOf(delimiter);

            if (n == 0 || pos < 0)
            { return (pos >= 0) ? s.Substring(0, pos) : s; }

            int nDelims = 1;

            while (nDelims < n && pos >= 0)
            {
                pos = s.IndexOf(delimiter, pos + 1);
                nDelims++;
            }

            string result = "";

            if (pos >= 0)
            {
                int nextDelim = s.IndexOf(delimiter, pos + 1);
                result = (nextDelim < 0) ? s.Substring(pos + 1) : s.Substring(pos + 1, nextDelim - pos - 1);
            }

            return result;
        }

    }
}

And a small program to test it:

using System;
using System.Diagnostics;
using System.Linq;
using ExtensionMethods;

namespace ConsoleApp1
{

    class Program
    {

        static void Main(string[] args)
        {
            // test data...
            string s = string.Join(";", Enumerable.Range(65, 26).Select(c => (char)c));
            s = s.Insert(3, ";;;");

            string o = "";

            Stopwatch sw = new Stopwatch();

            sw.Start();
            for (int i = 1; i <= 1000000; i++) {
                o = s.Split(';', 21);
            }
            sw.Stop();
            Console.WriteLine("Item directly selected: " + sw.ElapsedMilliseconds);

            sw.Restart();
            for (int i = 1; i <= 1000000; i++) {
                o = s.Split(';')[21];
            }
            sw.Stop();
            Console.WriteLine("Item from split array:  " + sw.ElapsedMilliseconds + "\r\n");


            Console.WriteLine(s);
            Console.WriteLine(o);

            Console.ReadLine();

        }
    }
}

Sample output:

Item directly selected: 1016
Item from split array: 1345

A;B;;;;C;D;E;F;G;H;I;J;K;L;M;N;O;P;Q;R;S;T;U;V;W;X;Y;Z
S


Reference: How to: Implement and Call a Custom Extension Method (C# Programming Guide)

Andrew Morton
  • 24,203
  • 9
  • 60
  • 84
  • That's some amazing difference I am seeing here :) Thanks for this, could you also maybe add Regex to this list? That would be conclusive enough of the options at hand I assume. I tried with `Regex regex = new Regex(@"^(\w+,){21}(\w+)", RegexOptions.Singleline)` but that did not work as expected. Could you maybe try from here, if you haven't thought about this? Thanks again! – re3el Jan 18 '19 at 21:49
  • 1
    @re3el I had originally started out with a regex solution, but my regex-fu is weak today. Perhaps you'd like to direct a comment to user [lagripe](https://stackoverflow.com/users/8004593/lagripe) in the original version of this question and suggest they create an answer here. – Andrew Morton Jan 18 '19 at 21:53
  • Your solution seems to be running faster than all other solutions posted on this page. Could you possibly check if you are seeing the same? Not completely sure how it could be machine specific but am just unable to get the benchmark results posted by @Simonare from his answer. – re3el Jan 19 '19 at 00:19
2

try this:

public static string MyExtension(this string s, char delimiter, int n)
{
    var begin = n== 0 ? 0 : Westwind.Utilities.StringUtils.IndexOfNth(s, delimiter, n);
    if (begin == -1)
        return null;
    var end = s.IndexOf(delimiter, begin +  (n==0?0:1));
    if (end == -1 ) end = s.Length;
    //var end = Westwind.Utilities.StringUtils.IndexOfNth(s, delimiter, n + 1);
    var result = s.Substring(begin +1, end - begin -1 );

    return result;
}

PS: Library used is Westwind.Utilities


Benchmark Code:

void Main()
{

     string s = string.Join(";", Enumerable.Range(65, 26).Select(c => (char)c));
            s = s.Insert(3, ";;;");

            string o = "";

            Stopwatch sw = new Stopwatch();

            sw.Start();
            for (int i = 1; i <= 1000000; i++) {
                o = s.Split(';', 21);
            }
            sw.Stop();
            Console.WriteLine("Item directly selected: " + sw.ElapsedMilliseconds);


            sw.Restart();
            for (int i = 1; i <= 1000000; i++) {
                o = s.MyExtension(';', 21);
            }
            sw.Stop();
            Console.WriteLine("Item directly selected by MyExtension: " + sw.ElapsedMilliseconds);

            sw.Restart();
            for (int i = 1; i <= 1000000; i++) {
                o = s.Split(';')[21];
            }
            sw.Stop();
            Console.WriteLine("Item from split array:  " + sw.ElapsedMilliseconds + "\r\n");


            Console.WriteLine(s);
            Console.WriteLine(o);

}

public static class MyExtensions
{
    /// <summary>
    /// Get the nth item from a delimited string.
    /// </summary>
    /// <param name="s">The string to retrieve a delimited item from.</param>
    /// <param name="delimiter">The character used as the item delimiter.</param>
    /// <param name="n">Zero-based index of item to return.</param>
    /// <returns>The nth item or an empty string.</returns>
    public static string Split(this string s, char delimiter, int n)
    {

        int pos = pos = s.IndexOf(delimiter);

        if (n == 0 || pos < 0)
        { return (pos >= 0) ? s.Substring(0, pos) : s; }

        int nDelims = 1;

        while (nDelims < n && pos >= 0)
        {
            pos = s.IndexOf(delimiter, pos + 1);
            nDelims++;
        }

        string result = "";

        if (pos >= 0)
        {
            int nextDelim = s.IndexOf(delimiter, pos + 1);
            result = (nextDelim < 0) ? s.Substring(pos + 1) : s.Substring(pos + 1, nextDelim - pos - 1);
        }

        return result;
    }

    public static string MyExtension(this string s, char delimiter, int n)
    {
        var begin = n== 0 ? 0 : Westwind.Utilities.StringUtils.IndexOfNth(s, delimiter, n);
        if (begin == -1)
            return null;
        var end = s.IndexOf(delimiter, begin +  (n==0?0:1));
        if (end == -1 ) end = s.Length;
        //var end = Westwind.Utilities.StringUtils.IndexOfNth(s, delimiter, n + 1);
        var result = s.Substring(begin +1, end - begin -1 );

        return result;
    }

}

Results:

Item directly selected: 277
Item directly selected by MyExtension: 114
Item from split array:  1297

A;B;;;;C;D;E;F;G;H;I;J;K;L;M;N;O;P;Q;R;S;T;U;V;W;X;Y;Z
S

Edit: Thanks to @Kalten, I enhanced solution further. Considerable difference has been seen on benchmark results.

Derviş Kayımbaşıoğlu
  • 28,492
  • 4
  • 50
  • 72
  • I think it's probably OK to give a link to `https://github.com/rickstrahl/westwind.utilities` - if you are connected with it any way (e.g. a contributor), just say so. – Andrew Morton Jan 18 '19 at 21:48
  • 1
    Is there a way to not restart from the beginning of the string for the end index? – Kalten Jan 18 '19 at 21:53
  • I am not but I am going to share benchmark results as well. – Derviş Kayımbaşıoğlu Jan 18 '19 at 21:54
  • @Kalten Thank you, I enhanced my solution further. the elapsed time dropped further (considerable drop has been seen) – Derviş Kayımbaşıoğlu Jan 18 '19 at 22:12
  • You need to check if you are not at the end of the string (last item may not have a right delimiter). And also you can't select the first column :p – Kalten Jan 18 '19 at 22:13
  • motivation brings success :) please check my ansawer – Derviş Kayımbaşıoğlu Jan 18 '19 at 22:24
  • Yep the last column seem ok now. But not the first. You should look for `begin` first and after : `if n == 0 && begin == -1`, then return the whole string, `if n == 0 && begin >= 0` then return the substring from 0 to begin -1. Somethiong like that. – Kalten Jan 18 '19 at 22:29
  • Now they both works. But I am angry to myself that I didn't put enough effort on it at the beginning. I probably enhance it more – Derviş Kayımbaşıoğlu Jan 18 '19 at 22:39
  • 1
    @Simonare That utility appears to be a useful discovery. I think that any enhancements (e.g. splitting on a string instead of a character) might go a little bit too far beyond what the question asked for. Of course, if someone wants the gold-plated version... :) – Andrew Morton Jan 18 '19 at 22:43
  • 2
    \o/ For information, the source code of `IndexOfNth` is [here](https://github.com/RickStrahl/Westwind.Utilities/blob/master/Westwind.Utilities/Utilities/StringUtils.cs#L158). Nothing more than a for loop – Kalten Jan 18 '19 at 22:45
  • 1
    @Kalten, Yep I already dig into but I didn't want to remove it. The repo looks like usefull and I believe It may worth to support enthusiastic developers :) – Derviş Kayımbaşıoğlu Jan 18 '19 at 22:48
  • 1
    @Simonare Now that makes me wonder what I've done inefficiently in my attempt, as I expected using IndexOf to skip over parts to be faster than just looking at each character in user code. – Andrew Morton Jan 18 '19 at 22:50
  • I am also looking into it, if you find something please let me know :) – Derviş Kayımbaşıoğlu Jan 18 '19 at 23:13
  • @AndrewMorton Take a look to [Cache prefetching](https://en.wikipedia.org/wiki/Cache_prefetching). This often give better result with sequential array loop. With branch prediction too, you always need to run benchmark – Kalten Jan 18 '19 at 23:16
1

By using the following Regex : ^([^;]*;){21}(.*?); , with that you don't have to generate the hole split list to search for your desired position, and once you reach it, it gonna be a matter of whether exists or not.

Explanation :

^ --> start of a line.

([^;]*;){Position - 1} --> notice that the symbol ; here is the delimiter, the expression will loop Pos - 1 times

(.*?) --> Non-Greedy .*

DEMO

For more about regular expressions on C# : documentation

In the example below i did implemant the two samples to show you how it works.

Match Method : documentation (Basically it searchs only for the first occurence of the pattern) RegexOptions.Singleline : Treats the input as a signle line.

C# Code

Console.WriteLine("First Delimiter : ");
        int Position = 22;
        char delimiter = ',';
        string pattern = @"^([^" + delimiter + "]*" + delimiter + "){" + (Position - 1) + @"}(.*?)" + delimiter;
        Regex regex = new Regex(pattern, RegexOptions.Singleline);
        // First Example
        string Data = @"AAV,zzz,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22ABC,23,24,24";
        Match Re = regex.Match(Data);
        if (Re.Groups.Count > 0)
            Console.WriteLine("\tMatch found : " + Re.Groups[2]);


        // Second Example
        Console.WriteLine("Second Delimiter : ");
        Position = 8;
        delimiter = ';';
        pattern = @"^([^" + delimiter + "]*" + delimiter + "){" + (Position - 1) + @"}(.*?)" + delimiter;
        Data = @"61d2e3f6-bcb7-4cd1-a81e-4f8f497f0da2;0;192.100.0.102:4362;2014-02-14;283;0;354;23;0;;;""0x8D15A2913C934DE"";Thursday, 19-Jun-14 22:58:10 GMT;";
        regex = new Regex(pattern, RegexOptions.Singleline);
        Re = regex.Match(Data);
        if (Re.Groups.Count > 0)
            Console.WriteLine("\tMatch found : " + Re.Groups[2]);

Output :

First Delimiter :

    Match found : 22ABC

Second Delimiter :

    Match found : 23
lagripe
  • 766
  • 6
  • 18
1

If you want to be sure the code parses the string in only one pass, and only parses what is needed, you can write the routine that iterates over the string yourself.

Since all c# strings implement IEnumerable<char> it is fairly straightforward to devise a method that requires zero string allocations:

static public IEnumerable<char> GetDelimitedField(this IEnumerable<char> source, char delimiter, int index)
{
    foreach (var c in source)
    {
        if (c == delimiter) 
        {
            if (--index < 0) yield break;
        }
        else
        {
            if (index == 0) yield return c;
        }
    }
}

This returns the result as an IEnumerable<char> but it's cheap to convert to a string. It's going to be a much shorter string at this point anyway.

static public string GetDelimitedString(this string source, char delimiter, int index)
{
    var result = source.GetDelimitedField(delimiter, index);
    return new string(result.ToArray());
}

And you can call it like this:

var input ="Zero,One,Two,Three,Four,Five,Six";
var output = input.GetDelimitedString(',',5);
Console.WriteLine(output);

Output:

Five

Example on DotNetFiddle

John Wu
  • 50,556
  • 8
  • 44
  • 80
-1

Too late for "answer" but this code gives me a run time of about 0.75 seconds with both strings processed 1,000,000 times. Difference this time is that now I'm not Marshaling an object but using pointers.

And this time I am returning a single new string (String.Substring).

using System;
using System.Diagnostics;
using System.Runtime.InteropServices;

class Program
{
    static void Main(string[] args)
    {
        string testString1 = "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24";
        string testString2 = "61d2e3f6-bcb7-4cd1-a81e-4f8f497f0da2;0;192.100.0.102:4362;2014-02-14;283;0;354;23;0;;;\"0x8D15A2913C934DE\";Thursday, 19-Jun-14 22:58:10 GMT;";

        Stopwatch sw = new Stopwatch();
        sw.Start();
        for (int i = 1; i < 1000000; i++)
        {
            Delimit(testString1, ',', 22);
            Delimit(testString2, ';', 6);
        }
        sw.Stop();
        Console.WriteLine($"==>{sw.ElapsedMilliseconds}");
        Console.ReadLine();
    }

    static string Delimit(string stringUnderTest, char delimiter, int skipCount)
    {
        const int SIZE_OF_UNICHAR = 2;

        int i = 0;
        int index = 0;
        char c = Char.MinValue;

        GCHandle handle = GCHandle.Alloc(stringUnderTest, GCHandleType.Pinned);
        try
        {
            IntPtr ptr = handle.AddrOfPinnedObject();
            for (i = 0; i < skipCount; i++)
                while ((char)Marshal.ReadByte(ptr, index += SIZE_OF_UNICHAR) != delimiter) ;
            i = index;
            while ((c = (char)Marshal.ReadByte(ptr, i += SIZE_OF_UNICHAR)) != delimiter) ;
        }
        finally
        {
            if (handle.IsAllocated)
                handle.Free();
        }

        return stringUnderTest.Substring((index + SIZE_OF_UNICHAR) >> 1, (i - index - SIZE_OF_UNICHAR) >> 1);
    }
}
Clay Ver Valen
  • 1,033
  • 6
  • 10
  • so what is the benefit? it takes forever to execute (arround 18 to 25 seconds) – Derviş Kayımbaşıoğlu Jan 18 '19 at 23:34
  • Console.WriteLine is very slow. When looping both call to `Delimit` in a for loop 1,000,000 times gives me a run time of ~35 seconds if I don't call Console.Write or Console.WriteLine. If I only call Console.Write and not Console.WriteLine the run time increases to ~100 times the earlier run time. As for not creating new Strings, making the function track offset and count and then returning a new String via testUnderTest.Substring results in a process memory of ~12MB and while running and maxes out at ~9MB when not creating the new String. Not sure how you're getting 25 sec run times. – Clay Ver Valen Jan 19 '19 at 00:42
  • @Simonare - I think you'll find this new solution runs fast enough. – Clay Ver Valen Jan 19 '19 at 02:03