how does the String.Split method determine separator precedence when passed multiple multi-character separators?

Question

If you have this code:

"......".Split(new String[]{"...", ".."}, StringSplitOptions.None);

The resulting array elements are:

 1. ""
 2. ""
 3. ""

Now if you reverse the order of the separators,

"......".Split(new String[]{"..", "..."}, StringSplitOptions.None);

The resulting array elements are:

 1. ""
 2. ""
 3. ""
 4. ""

From these 2 examples I feel inclined to conclude that the Split method recursively tokenizes as it goes through each element of the array from left to right.

However, once we throw in separators that contain alphanumeric characters into the equation, it is clear that the above theory is wrong.

  "5.x.7".Split(new String[]{".x", "x."}, StringSplitOptions.None)

results in: 1. "5" 2. ".7"

   "5.x.7".Split(new String[]{"x.", ".x"}, StringSplitOptions.None)

results in: 1. "5" 2. ".7"

This time we obtain the same output, which means that the rule theorized based on the first set of examples no longer applies. (ie: if separator precedence was always determined based on the position of the separator within the array, then in the last example we would have obtained "5." & "7" instead of "5" & ".7".

As to why I am wasting my time trying to guess how .NET standard API's work, it's because I want to implement similar functionality for my java apps, but neither StringTokenizer nor org.apache.commons.lang.StringUtils provide the ability to split a String using multiple multi-character separators (and even if I were to find an API that does provide this ability, it would be hard to know if it always tokenizes using the same algorithm used by the String.Split method.

`String#split` method in Java takes a regex as split criteria. So, you can merge as many split criteria using `pipe(|)`. Further, it would be better if you can post the real problem here, rather than equivalent code in other language. Not all people know multiple languages. — Rohit Jain, Feb 07 '13 at 22:43
@RohitJain: Even so, I would be interested in learning what .NET's algorithm is. — Matthew, Feb 07 '13 at 22:44

score 7 · Accepted Answer · answered Feb 07 '13 at 23:12

7

From MSDN:

To avoid ambiguous results when strings in separator have characters in common, the Split operation proceeds from the beginning to the end of the value of the instance, and matches the first element in separator that is equal to a delimiter in the instance. The order in which substrings are encountered in the instance takes precedence over the order of elements in separator.

So, for the first case ".." and "..." are found on the same position and their order in separator is used to determine the used one. For the second case, ".x" is found before "x." and the order of elements in separator does not apply.

answered Feb 07 '13 at 23:12

J. Calleja

4,855
2
33
54

+1 I was just about to post this. Nothing like going directly to the documentation... – Yuck Feb 07 '13 at 23:13
It is surprising to me that the docs are that rigorous. Wouldn't have expected that as they are sometime quite superficial. – usr Feb 07 '13 at 23:15

score 1 · Answer 2 · edited Jan 10 '14 at 19:54

1

I've had a quick look at this.. and it would appear that the private method MakeSeparatorList in the string class actually retrieves an array of indexes.. but it will match the first one it finds.

So, because .x comes before x. in both of your examples, that index is stored.

This is the code I used to test:

var s = "5.x.7";

string[] separators = new string[] { "x.", ".x" };
int[] sepList = new int[1024];
int[] lengthList = new int[1024];

MethodInfo dynMethod = s.GetType().GetMethods(BindingFlags.NonPublic | BindingFlags.Instance).Last(x => x.Name == "MakeSeparatorList");
dynMethod.Invoke(s, new object[] { separators, sepList, lengthList });

Debugger.Break();

See this screenshot:

(My screenshot isn't showing? :/)

Notice how the index is 1 (which results in .x) even though .x is the second entry in the array.

edited Jan 10 '14 at 19:54

Mohsen Safari

6,669
5
42
58

answered Feb 07 '13 at 23:13

Simon Whitehead

63,300
9
114
138

Can I just say... wow that code in the string class is horrible. `num1`, `num4`, pointers with horrible names.. it's so bad. – Simon Whitehead Feb 07 '13 at 23:16
Local names are not persisted in IL. Sometimes Reflector can extract the names from the PDB's I think bot not always. It has to make up these names. – usr Feb 08 '13 at 11:08
It does quite a good job with most assemblies.. but the BCL libraries are particularly bad – Simon Whitehead Feb 08 '13 at 11:32

score 0 · Answer 3 · answered Feb 07 '13 at 23:29

string .split does splits the first matching character matching to the argument. In simple Question : lets say you provided the option split("a", "b") and the String contains "appaleisbigapll" the algorithm is simple that is start with first character and matching with either of a or b. if it found these it does split and start with next character. in your example

5.x.7 with ".x", "x.". It rules with "or" operator so it finds .x first and checking the remaining .7 now as there is no matching character left so it leaves .7 as it is. Result 5 and .7

Same happening in the second question it founds .x and as the rule says .x or x. it continue with .7 the precedence is not applied here. And for your first set of example yes it does the split operation recursively.

"Yes it does the split operation recursively" I would say is incorrect. It only appears that way in the first example because they are in the order that it matches. It doesn't _actually_ happen that way. — Simon Whitehead, Feb 07 '13 at 23:40

how does the String.Split method determine separator precedence when passed multiple multi-character separators?

3 Answers3

Linked