70

I'm doing simple string input parsing and I am in need of a string tokenizer. I am new to C# but have programmed Java, and it seems natural that C# should have a string tokenizer. Does it? Where is it? How do I use it?

andrewrk
  • 30,272
  • 27
  • 92
  • 113

11 Answers11

121

You could use String.Split method.

class ExampleClass
{
    public ExampleClass()
    {
        string exampleString = "there is a cat";
        // Split string on spaces. This will separate all the words in a string
        string[] words = exampleString.Split(' ');
        foreach (string word in words)
        {
            Console.WriteLine(word);
            // there
            // is
            // a
            // cat
        }
    }
}

For more information see Sam Allen's article about splitting strings in c# (Performance, Regex)

Tom Hofman
  • 518
  • 5
  • 19
Davy Landman
  • 15,109
  • 6
  • 49
  • 73
  • https://learn.microsoft.com/en-us/dotnet/api/microsoft.extensions.primitives.stringtokenizer?view=dotnet-plat-ext-6.0 – juFo Apr 12 '23 at 12:26
26

I just want to highlight the power of C#'s Split method and give a more detailed comparison, particularly from someone who comes from a Java background.

Whereas StringTokenizer in Java only allows a single delimiter, we can actually split on multiple delimiters making regular expressions less necessary (although if one needs regex, use regex by all means!) Take for example this:

str.Split(new char[] { ' ', '.', '?' })

This splits on three different delimiters returning an array of tokens. We can also remove empty arrays with what would be a second parameter for the above example:

str.Split(new char[] { ' ', '.', '?' }, StringSplitOptions.RemoveEmptyEntries)

One thing Java's String tokenizer does have that I believe C# is lacking (at least Java 7 has this feature) is the ability to keep the delimiter(s) as tokens. C#'s Split will discard the tokens. This could be important in say some NLP applications, but for more general purpose applications this might not be a problem.

Samuel
  • 1,374
  • 10
  • 16
demongolem
  • 9,474
  • 36
  • 90
  • 105
19

The split method of a string is what you need. In fact the tokenizer class in Java is deprecated in favor of Java's string split method.

Brady Moritz
  • 8,624
  • 8
  • 66
  • 100
Tim Jarvis
  • 18,465
  • 9
  • 55
  • 92
  • 2
    AFAI, it's indeed deprecated, but not in favor of the `String#split` method. More or less in favor of the `Scanner` class. – bvdb Jan 20 '17 at 12:50
3

I think the nearest in the .NET Framework is

string.Split()
Steve Morgan
  • 12,978
  • 2
  • 40
  • 49
2
_words = new List<string>(YourText.ToLower().Trim('\n', '\r').Split(' ').
            Select(x => new string(x.Where(Char.IsLetter).ToArray()))); 

Or

_words = new List<string>(YourText.Trim('\n', '\r').Split(' ').
            Select(x => new string(x.Where(Char.IsLetterOrDigit).ToArray()))); 
Skyler
  • 21
  • 1
2

The similar to Java's method is:

Regex.Split(string, pattern);

where

  • string - the text you need to split
  • pattern - string type pattern, what is splitting the text
dimodi
  • 3,969
  • 1
  • 13
  • 23
neronovs
  • 247
  • 3
  • 6
2

For complex splitting you could use a regex creating a match collection.

1

use Regex.Split(string,"#|#");

demongolem
  • 9,474
  • 36
  • 90
  • 105
0

read this, split function has an overload takes an array consist of seperators http://msdn.microsoft.com/en-us/library/system.stringsplitoptions.aspx

Musa
  • 1
  • 1
-1

If you're trying to do something like splitting command line arguments in a .NET Console app, you're going to have issues because .NET is either broken or is trying to be clever (which means it's as good as broken). I needed to be able to split arguments by the space character, preserving any literals that were quoted so they didn't get split in the middle. This is the code I wrote to do the job:

private static List<String> Tokenise(string value, char seperator)
{
    List<string> result = new List<string>();
    value = value.Replace("  ", " ").Replace("  ", " ").Trim();
    StringBuilder sb = new StringBuilder();
    bool insideQuote = false;
    foreach(char c in value.ToCharArray())
    {
        if(c == '"')
        {
            insideQuote = !insideQuote;
        }
        if((c == seperator) && !insideQuote)
        {
            if (sb.ToString().Trim().Length > 0)
            {
                result.Add(sb.ToString().Trim());
                sb.Clear();
            }
        }
        else
        {
            sb.Append(c);
        }
    }
    if (sb.ToString().Trim().Length > 0)
    {
        result.Add(sb.ToString().Trim());
    }

    return result;
}
Nigel Thomas
  • 159
  • 1
  • 4
-2

If you are using C# 3.5 you could write an extension method to System.String that does the splitting you need. You then can then use syntax:

string.SplitByMyTokens();

More info and a useful example from MS here http://msdn.microsoft.com/en-us/library/bb383977.aspx

Paul Shannon
  • 1,145
  • 10
  • 15
  • 10
    This is a solution to a local problem, not an obvious/general purpose System.String operation. A utility class might be in order, but it would be extension method abuse to use an extension method here. – Sam Harwell Jul 15 '09 at 21:48