3

Can anyone explain this behaviour?

var culture = new CultureInfo("da-DK");
Thread.CurrentThread.CurrentCulture = culture;
"daab".StartsWith("da"); //false

I know that it can be fixed by specifying StringComparison.InvariantCulture. But I'm just confused by the behavior.

I also know that "aA" and "AA" are not considered the same in a Danish case-insensitive comparision, see http://msdn.microsoft.com/en-us/library/xk2wykcz.aspx. Which explains this

String.Compare("aA", "AA", new CultureInfo("da-DK"), CompareOptions.IgnoreCase) // -1 (not equal)

Is this linked to the behavior of the first code snippet?

Matt Warren
  • 10,279
  • 7
  • 48
  • 63
  • 3
    It seems like the second `a` gives the first one another context. So `aa` is basically considered as one entity. But I cant tell whether its a bug or a feature, because I do not know the danish language. – Nappy Jun 30 '11 at 14:14
  • Right. See the wikipedia article about the danish/norwegian alphabet, especially the part "history": http://en.wikipedia.org/wiki/Danish_and_Norwegian_alphabet – Maximilian Mayerl Jun 30 '11 at 14:23
  • I agree. "aa" in Danish ("å" in modern Danish) is a different letter from "a", therefore "daab" doesn't start with "da", just as "dåb" doesn't start with "da". (You'll have to check whether "å" is the same as "aa"; in theory it should be.) – MRAB Jun 30 '11 at 14:28
  • 1
    `"daab".StartsWith("då")` also returns false... apparently the Danish language works in mysterious ways, unless it's the .NET Framework ;) – Thomas Levesque Jun 30 '11 at 14:42
  • In danish ae = æ, oe = ø, aa = å. Æ Ø Å (here writen in alphabetic order) are the only three special characters in danish. ae, oe, aa are remnants from the past, and never used in the everyday language, only in proper nouns. More importantly, the letters can also be used as a word, e.g. 'ae' means 'stroke/pat'. And im pretty sure they are also used as a part of a word, where they do not represent æøå, but i cant remember one of these words right now. Ill try look for one. – Lars Udengaard Jul 01 '11 at 07:28

3 Answers3

6

Here a test that illustrates the problem, daab og dåb (same word in old and modern language respectively) means baptism/christening.

public class can_handle_remnant_of_danish_language
{
    [Fact]
    public void daab_start_with_då()
    {
        var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
        Assert.True("daab".StartsWith("då")); // Fails
    }

    [Fact]
    public void daab_start_with_da()
    {
        var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
        Assert.True("daab".StartsWith("da")); // Fails
    }

    [Fact]
    public void daab_start_with_daa()
    {
        var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
        Assert.True("daab".StartsWith("daa")); // Succeeds
    }

    [Fact]
    public void dåb_start_with_daa()
    {
        var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
        Assert.True("dåb".StartsWith("daa")); // Fails
    }

    [Fact]
    public void dåb_start_with_da()
    {
        var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
        Assert.True("dåb".StartsWith("da")); // Fails
    }

    [Fact]
    public void dåb_start_with_då()
    {
        var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
        Assert.True("dåb".StartsWith("då")); // Succeeds
    }
}

All the above tests should be successfull with my understanding of the language, and im danish! I aint got no degree in grammar though. :-)

Seems like a bug to me.

Lars Udengaard
  • 1,227
  • 10
  • 21
  • even more inconsistent: 'foer', which both means 'linning' and 'foer' as in 'før' which means 'before'. given the same setup as above, the test 'foer_start_with_fo' does not fail in the same way as the test 'daab_start_with_da'. – Lars Udengaard Jul 01 '11 at 08:17
5

Like Nappy said, its a feature of the danish language, where "aa" and "å" is still the same. Danish got another two letters, æ and ø, but I am not sure if they can be written using two letters as well.

I think in the second example "aA" is not changed while "AA" is changed to "Å". Just to confuse things even more, "Aa" is considered equal to "AA" and "aa" only when using case-insensitive comparing.

Martin Brenden
  • 257
  • 2
  • 8
  • What I personally do not understand is why this is relevant in _STRING_-Comparisons, because for example `oe` is not considered the same like `ö` in the german `de-DE` culture. So why it is so important in the danish language? – Nappy Jun 30 '11 at 14:33
  • Not an expert on danish or german, but I think `aa` is still very common in danish. I know that in Norwegian `aa` is only rarely used in old family and place names – Martin Brenden Jun 30 '11 at 16:19
  • I was recently stumped by this exact curiosity ([http://stackoverflow.com/questions/15547663/aaaa-startswithaaa-returns-false]), and Martin Brendan is correct - many surnames, and even some of the largest cities still use "aa" (e.g. Aarhus, Aalborg). Without it, a search using "aa" wouldn't return my last name :) – sondergard Mar 21 '13 at 15:24
0

The modern spelling of "baptism" in Danish, namely dåb, is certainly not considered to start with da, for a Danophone. If daab is supposed to be an old-fashioned spelling of dåb, it is a bit philosophical whether it starts with da or not. But for (modern) collation purposes, it does not (alphabetically, such daab goes after disk, not before).

However, if your string is not supposed to represent natural language, but is instead some kind of technical code, like hexadecimal digits, surely you do not want to use any culture-specific rules. The solution here is not to use the invariant culture. The invariant culture has (English) rules itself!

Instead, you want to use ordinal comparison.

Ordinal comparison simply compares the strings char by char, without any assumptions of what sequences are "equivalent" in some sense. (Technical remark: Each char is a UTF-16 code unit, not a "character". Ordinal comparison is ignorant of the rules of Unicode normalization.)

I think the confusion arises because, by default, some string methods use a culture-aware comparison, and other string methods use the ordinal comparison.

The following examples all use a culture-aware comparison:

"Straße".StartsWith("Strasse", StringComparison.CurrentCulture)
"Straße".Equals("Strasse", StringComparison.CurrentCulture)
"ne\u0301e".StartsWith("née", StringComparison.CurrentCulture)
"ne\u0301e".Equals("née", StringComparison.CurrentCulture)

"Straße".StartsWith("Strasse")  // CurrentCulture is default for 'StartsWith'!
"ne\u0301e".StartsWith("née")   // CurrentCulture is default for 'StartsWith'!

Each of the above may depend on the .NET version as well! (As an example, the first one gives true if the current culture is the invariant culture and you are under .NET Framework 4.8; but it gives false if the current culture is the invariant culture and you use .NET 6.)

But these examples use ordinal comparison:

"Straße".StartsWith("Strasse", StringComparison.Ordinal)
"Straße".Equals("Strasse", StringComparison.Ordinal)
"ne\u0301e".StartsWith("née", StringComparison.Ordinal)
"ne\u0301e".Equals("née", StringComparison.Ordinal)

"Straße".Equals("Strasse")  // Ordinal is default for 'Equals'!
"ne\u0301e".Equals("née")   // Ordinal is default for 'Equals'!

So remember to check what the default comparison is for the string method you use, and specify the opposite one if needed. (Or always specify the comparison, even when redundant, if you prefer.)

Jeppe Stig Nielsen
  • 60,409
  • 11
  • 110
  • 181