10

My problem is that String.IndexOf returns -1. I would expect it to return 0.

The parameters:

text = C:\\Users\\User\\Desktop\\Sync\\̼ (note the Combining Seagull Below character)

stringToTrim = C:\\Users\\User\\Desktop\\Sync\\

When I check for the index, using int index = text.IndexOf(stringToTrim);, the value of index is -1. I found that using an ordinal string comparison solved this problem of mine:

int index = text.IndexOf(stringToTrim, StringComparison.Ordinal);

Reading online, a lot of Unicode characters (like U+00B5 and U+03BC) map to the same symbol, so it would be a good idea to expand on this and normalize both strings:

int index = text.Normalize(NormalizationForm.FormKD).IndexOf(stringToTrim.Normalize(NormalizationForm.FormKD), StringComparison.Ordinal);

Is this the correct approach to check at what index one string contains all sequential characters of another string? So the idea is, you normalize when you want to check that symbols are a match, but you don't normalize when you want to check characters by their encoded values (allow duplicate symbols, therefore)? Also, could someone please explain why int index = text.IndexOf(stringToTrim); did not find a match at the start of the string? In other words, what is it actually doing under the covers? I would have expected it to start searching characters from the beginning of the string to the end of the string.

Alexandru
  • 12,264
  • 17
  • 113
  • 208
  • I copied / pasted this into LinqPad and got "0" back - maybe I don't understand combining characters. – dnord Dec 15 '14 at 20:52
  • @dnord Try this: `"C:\\Users\\User\\Desktop\\Sync\\̼".IndexOf("C:\\Users\\User\\Desktop\\Sync\\");` Make sure to copy this text entirely/exactly from here! – Alexandru Dec 15 '14 at 20:57
  • (Thanks that worked.) Then I surely agree with the top rater answer below: either combining characters change the previous character (by combining) or you've found a weird bug that at least Microsoft warned you about. – dnord Dec 15 '14 at 21:00
  • @dnord You may also find this one character interesting: http://unicode-table.com/en/search/?q=U%2B202E (right to left override, make no mistake, if you highlight over what is shown as blank and paste this character somewhere, and start typing, characters type to the left instead of to the right, so "like this" would become "siht ekil" as you type it out. – Alexandru Dec 15 '14 at 21:05
  • @dnord There's also various ways of exploiting this character, but its a bit off-topic to my question, still something I love showing people: http://krebsonsecurity.com/2011/09/right-to-left-override-aids-email-attacks/ – Alexandru Dec 15 '14 at 21:53
  • Some characters don't play nice. ̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼ – Alexandru Mar 27 '15 at 16:19
  • Which is why its a good idea to whitelist Unicode ranges (PS Unicode is a moving target so don't blacklist it): http://jrgraphix.net/research/unicode_blocks.php – Alexandru Mar 27 '15 at 16:32

2 Answers2

6

The behavior makes perfect sense to me. You are using a combining character, which is combined with the preceding character, turning it into a different character, one which won't match the '\\' character you've specified at the end of your search string. That prevents the entire string you're looking for from being found. If you looked for "C:\\Users\\User\\Desktop\\Sync" instead, it would have found it.

Using StringComparison.Ordinal tells .NET to ignore the various rules for characters and look only at their exact ordinal value. This seems to do what you wanted, so yes…that's what you should do.

The "correct approach" depends entirely on what behavior you want. A lot of string manipulation involves text being presented to or provided by the user and should be done in a culture-aware and Unicode-aware way. Other times, that isn't desirable. It's important to select the right approach for your needs.

Peter Duniho
  • 68,759
  • 7
  • 102
  • 136
1

Yes, you should use StringComparison.Ordinal to guarantee the culture is ignored when comparing the value. It is necessary especially for all the strings that are consider to be culture invariant "by default". That includes file paths.

When not using StringComparison.Ordinal) it is possible to introduce subtle bugs: http://msdn.microsoft.com/en-us/library/dd465121(v=vs.110).aspx

When culturally independent string data, such as XML tags, HTML tags, user names, file paths, and the names of system objects, are interpreted as if they were culture-sensitive, application code can be subject to subtle bugs, poor performance, and, in some cases, security issues.

Some side benefit of StringComparison.Ordinal is better performance: http://msdn.microsoft.com/en-us/library/ms973919.aspx

PiotrWolkowski
  • 8,408
  • 6
  • 48
  • 68