Why isn't string.Normalize consistent depending on the context?

Question

I have the following code:

string input = "ç";
string normalized = input.Normalize(NormalizationForm.FormD);
char[] chars = normalized.ToCharArray();

I build this code with Visual studio 2010, .net4, on a 64 bits windows 7.

I run it in a unit tests project (platform: Any CPU) in two contexts and check the content of chars:

Visual Studio unit tests : chars contains { 231 }.
ReSharper : chars contains { 231 }.
NCrunch : chars contains { 99, 807 }.

In the msdn documentation, I could not find any information presenting different behaviors.

So, why do I get different behaviors? For me the NCrunch behavior is the expected one, but I would expect the same for others.

Edit: I switched back to .Net 3.5 and still have the same issue.

Hmm, I get { 99, 807 } with Visual Studio... This would imply there is something about the configuration of your project... Maybe. — zmilojko, May 10 '12 at 08:11
@zmilojko. Thanks for your testing. I get the same results as yours in a blank new project. So I am checking the differences between the two projects (winmerge on csproj), but could not find relevant yet, which was the reason for me posting this question: understand which context could induce a different behavior. — remio, May 10 '12 at 10:15
@AakashM, In all cases, `Thread.CurrentThread.CurrentCulture` is `fr-FR`. I also checked `Thread.CurrentThread.CurrentUICulture` which is `en-US` in all cases. — remio, May 11 '12 at 13:03
@MattHickford, I gently move my mouse over the `chars` variable in the debugger, then unfold the `+` sign. — remio, May 11 '12 at 13:05
@AakashM, I used the `ç` character in my example, but I get the same behavior with all of the french accentuated characters I have tested. — remio, May 11 '12 at 13:36
If I had to guess I'd say something strange is going on with the build configurations, causing an old version of the code to be run in by resharper and visual studio, but one that ncrunch ignores. For example, a library set to build the any configuration, but the GUI set to x86. — Phil Martin, May 12 '12 at 07:51
@PhilMartin, I am also suspicious about something like that. So, I cleaned it all (hopefully), rebuilt it, also tried it on another computer. Several times. Same result. — remio, May 12 '12 at 08:10
@PhilMartin, However, I would be really interested in understanding which parameter make `string.Normalize` behave differently. — remio, May 12 '12 at 08:17

eis · Accepted Answer · 2012-05-14T15:38:50.657

7

In String.Normalize(NormalizationForm) documentation it says that

binary representation is in the normalization form specified by the normalizationForm parameter.

which means you'd be using FormD normalization on both cases, so CurrentCulture and such should not really matter.

The only thing that could change, then, what I can think of is the "ç" character. That character is interpreted as per character encoding that is either assumed or configured for Visual Studio source code files. In short, I think NCrunch is assuming different source file encoding than the others.

Based on quick searching on NCrunch forum, there was a mention of some UTF-8 -> UTF-16 conversion, so I would check that.

edited May 14 '12 at 15:38

answered May 14 '12 at 11:51

eis

51,991
13
150
199

1

Indeed, I was strongly suspecting the encoding of te `ç` character in the source / runtime code. I started playing with the encoding of the source file with no luck. Then, I tried to read the string from an external file, which failed until I forced its encoding to `UTF-8`. Finally, I updated my declaration of `input` to `string input = new string(new[] { (char)231 });`, and... it works! – remio May 14 '12 at 13:13

Why isn't string.Normalize consistent depending on the context?

1 Answers1