Efficiently split a string in format "{ {}, {}, ...}"

Question

I have a string in the following format.

string instance = "{112,This is the first day 23/12/2009},{132,This is the second day 24/12/2009}"

private void parsestring(string input)
{
    string[] tokens = input.Split(','); // I thought this would split on the , seperating the {}
    foreach (string item in tokens)     // but that doesn't seem to be what it is doing
    {
       Console.WriteLine(item); 
    }
}

My desired output should be something like this below:

112,This is the first day 23/12/2009
132,This is the second day 24/12/2009

But currently, I get the one below:

{112
This is the first day 23/12/2009
{132
This is the second day 24/12/2009

I am very new to C# and any help would be appreciated.

If the format is really that simple, split on `"},{"` for the separator, and then remove the orphan `{` from the first item in he result array, and the orphan `}` from the last item in the result array. — 15ee8f99-57ff-4f92-890c-b56153, Sep 26 '19 at 19:22
you can use the TextFieldParser class even though its defined in the VisualBasic space. https://learn.microsoft.com/en-us/dotnet/api/microsoft.visualbasic.fileio.textfieldparser?view=netframework-4.8 Also, if your inner text will not contain double quotes, then you can replace curly braces with double quotes, then use thhe text field parser to ignore the quotes — Jeremy, Sep 26 '19 at 19:27
@EdPlunkett When I use the ```"},{"``` in the split method, I get error cannot convert to char from string. I am not sure what is wrong in it? — Analia, Sep 26 '19 at 19:27
@Analia `string[] tokens = input.Split("},{");` works for me. — 15ee8f99-57ff-4f92-890c-b56153, Sep 26 '19 at 19:28
@EdPlunkett It will preserve the first { and the last } though — Abishek Aditya, Sep 26 '19 at 19:33
@AbishekAditya "...and then remove the orphan `{` from the first item in [t]he result array..." etc., see first comment above. — 15ee8f99-57ff-4f92-890c-b56153, Sep 26 '19 at 19:34
Oops, I missed that. But it still seems to me that breaking down into the complete subparts and removing the brackets using a loop is more readable than adding an auxiliary function that might confuse people during review. Any way, that is just nitpicking. Your answer works perfectly for this use case — Abishek Aditya, Sep 26 '19 at 19:37
@AbishekAditya Mine was a bit of a cave-man approach, but I felt that it would be more understandable to a novice than regular expressions. A novice could apply the same insight to similar problems without learning any new line-noise languages (I love regexes, but they're what they are). All the better to have both, of course, since the site is meant for all skill levels. rschoenbach's answer is kind of neat too. — 15ee8f99-57ff-4f92-890c-b56153, Sep 26 '19 at 20:19

OwenP · Answer 1 · 2019-09-26T19:54:49.330

Don't fixate on Split() being the solution! This is a simple thing to parse without it. Regex answers are probably also OK, but I imagine in terms of raw efficiency making "a parser" would do the trick.

IEnumerable<string> Parse(string input)
{
    var results = new List<string>();
    int startIndex = 0;            
    int currentIndex = 0;

    while (currentIndex < input.Length)
    {
        var currentChar = input[currentIndex];
        if (currentChar == '{')
        {
            startIndex = currentIndex + 1;
        }
        else if (currentChar == '}')
        {
            int endIndex = currentIndex - 1;
            int length = endIndex - startIndex + 1;
            results.Add(input.Substring(startIndex, length));
        }

        currentIndex++;
    }

    return results;
}

So it's not short on lines. It iterates once, and only performs one allocation per "result". With a little tweaking I could probably make a C#8 version with Index types that cuts on allocations? This is probably good enough.

You could spend a whole day figuring out how to understand the regex, but this is as simple as it comes:

Scan every character.
If you find {, note the next character is the start of a result.
If you find }, consider everything from the last noted "start" until the index before this character as "a result".

This won't catch mismatched brackets and could throw exceptions for strings like "}}{". You didn't ask for handling those cases, but it's not too hard to improve this logic to catch it and scream about it or recover.

For example, you could reset startIndex to something like -1 when } is found. From there, you can deduce if you find { when startIndex != -1 you've found "{{". And you can deduce if you find } when startIndex == -1, you've found "}}". And if you exit the loop with startIndex < -1, that's an opening { with no closing }. that leaves the string "}whoops" as an uncovered case, but it could be handled by initializing startIndex to, say, -2 and checking for that specifically. Do that with a regex, and you'll have a headache.

The main reason I suggest this is you said "efficiently". icepickle's solution is nice, but Split() makes one allocation per token, then you perform allocations for each TrimX() call. That's not "efficient". That's "n + 2 allocations".

Fair point on the efficiency, but I only wanted to make a more clear example, as I think the OP needs to start with the basics :) Nice solution, but why not `yield`, your solution seems to be perfect for it ;) — Icepickle, Sep 26 '19 at 20:27
@Icepickle I thought about `yield` when I got to the end and avoided it for the same reason you quoted: I wanted to stick to the basics. `yield return` takes a bit to explain and is kind of janky until you get the hang of it! — OwenP, Sep 27 '19 at 14:36

score 6 · Answer 2 · answered Sep 26 '19 at 19:28

6

Use Regex for this:

string[] tokens = Regex.Split(input, @"}\s*,\s*{")
  .Select(i => i.Replace("{", "").Replace("}", ""))
  .ToArray();

Pattern explanation:

\s* - match zero or more white space characters

answered Sep 26 '19 at 19:28

Michał Turczyn

32,028
14
47
69

Why not use lookahead, he only wants to split on , that come after a } – Abishek Aditya Sep 26 '19 at 19:29
@AbishekAditya It doesn't make any difference :) – Michał Turczyn Sep 26 '19 at 19:29

score 5 · Accepted Answer · answered Sep 26 '19 at 19:34

Well, if you have a method that is called ParseString, it's a good thing it returns something (and it might not be that bad to say that it is ParseTokens instead). So if you do that, you can come to the following code

private static IEnumerable<string> ParseTokens(string input)
{
    return input
        // removes the leading {
        .TrimStart('{')
        // removes the trailing }
        .TrimEnd('}')
        // splits on the different token in the middle
        .Split( new string[] { "},{" }, StringSplitOptions.None );
}

The reason why it didn't work for you before, is because your understanding of how the split method works, was wrong, it will effectively split on all , in your example.

Now if you put this all together, you get something like in this dotnetfiddle

using System;
using System.Collections.Generic;

public class Program
{
    private static IEnumerable<string> ParseTokens(string input)
    {
        return input
            // removes the leading {
            .TrimStart('{')
            // removes the trailing }
            .TrimEnd('}')
            // splits on the different token in the middle
            .Split( new string[] { "},{" }, StringSplitOptions.None );
    }

    public static void Main()
    {
        var instance = "{112,This is the first day 23/12/2009},{132,This is the second day 24/12/2009}";
        foreach (var item in ParseTokens( instance ) ) {
            Console.WriteLine( item );
        }
    }
}

score 1 · Answer 4 · answered Sep 26 '19 at 19:28

1

Add using System.Text.RegularExpressions; to top of the class

and use the regex split method

string[] tokens = Regex.Split(input, "(?<=}),");

Here, we use positive lookahead to split on a , which is immediately after a }

(note: (?<= your string ) matches all the characters after your string only. you can read more about it here

answered Sep 26 '19 at 19:28

Abishek Aditya

802
4
11

@Analia you can remove the { and } using a simple text replace after this – Abishek Aditya Sep 26 '19 at 19:34

score 1 · Answer 5 · answered Sep 26 '19 at 19:32

If you dont want to your regular expressions, the following code will produce your required output.

        string instance = "{112,This is the first day 23/12/2009},{132,This is the second day 24/12/2009}";

        string[] tokens = instance.Replace("},{", "}{").Split('}', '{');
        foreach (string item in tokens)
        {
            if (string.IsNullOrWhiteSpace(item)) continue;

            Console.WriteLine(item);
        }

        Console.ReadLine();

Efficiently split a string in format "{ {}, {}, ...}"

5 Answers5