regular expression to detect ISO language code

Question

I'm trying to detect whether a combo box contains an ISO language code (i.e. en-GB, el-GR, ru-RU etc), which comprises of 2 alphabetical characters, a dash, and 2 more alphabetical characters (in upper case, or it might not matter?).

I was wondering, is there a way I can achieve this using regular expressions?

I'm assuming the expression would look something like this (but I don't have much experience in the subject):

string pattern = @"^\a{2,2}-\a{2,2}";

I can't take credit for this, but see http://www.pelagodesign.com/blog/2009/05/20/iso-8601-date-validation-that-doesnt-suck/. Also try Google before StackOverflow :-) — pixelbadger, Mar 14 '13 at 07:06
Note that what you do is detect something that looks like language code, but not validate whether it is a real language code. — nhahtdh, Mar 14 '13 at 08:20

score 17 · Accepted Answer · edited Aug 21 '17 at 18:21

17

Something like so should work: ^[a-z]{2}-[A-Z]{2}$.

The ^ anchor instructs the regex engine to start matching from the beginning of the string, [a-z] means any lower case letter between a and z. {2} means exactly 2 repetitions of. The same explanation holds for the rest. Finally, the $ instructs the regex engine to stop matching at the end of the string.

edited Aug 21 '17 at 18:21

Anirvan

6,214
5
39
53

answered Mar 14 '13 at 07:10

npinti

51,780
5
72
96

better use /gi option to do case-insensitive search if its a URL you work upon and to do global find (not sure if g is needed when replacing with string's replace in javascript) – George Birbilis Aug 19 '14 at 15:00
2

According to https://www.andiamo.co.uk/resources/iso-language-codes/, not all codes have the second part. – mythofechelon Oct 07 '21 at 21:57
1

This does not account for codes that don't have the second part, it would only work for those like `en-US` – wtfzambo Dec 02 '21 at 11:13

Mario Vázquez · Answer 2 · 2020-02-26T08:57:13.073

Accepted solution by @npinti could be not accurate enough if we take a closer look to the list of ISO 639x codes here. Alternatively you can get a culture list on your own by invoking the static method below (C# code):

System.Globalization.CultureInfo.GetCultures(CultureTypes.AllCultures);

Among the retrieved values, you will find non matching samples as "Cy-az-AZ" (3 codes!), "zh-CHS" (3 letters!) or "en-029" (numbers!). Curiously enough, the one with numbers does not appear in the MS link above, even though is retrieved by the CultureInfo method.

This article from here discusses the one with numbers.

So it doesn't seem an easy issue. We could try with a slightly more complex regex as the one shown below, but this doesn't guarantee that we'll be able to distinct an ISO culture code against whatever other thing. IMO, if we really have the need to be 100% reliable, probably the only choice is to seek that code into the list of codes to find an exact match.

Regex option:

^[^-]{2,3}-[^-]{2,3}(-[^-]{2,3})?$

Find option:

public static bool IsCultureCode(string code)
{
    CultureInfo[] cultures = CultureInfo.GetCultures(CultureTypes.SpecificCultures); //AllCultures
    int i = 0;
    while(i < cultures.Length && !cultures[i].Name.Equals(code, StringComparison.InvariantCultureIgnoreCase))
        i++;
    return i < cultures.Length;
}

score 6 · Answer 3 · answered Dec 01 '20 at 21:55

6

^[a-z]{2}(-[A-Z]{2})?$

first two chars must exist and be lowercase
last two chars if exist, must be uppercase and separated from the first 2 with a hyphen

matches:

en
en-US
tr
tr-TR
ru

answered Dec 01 '20 at 21:55

Mehmet N. Yarar

576
9
17

score 0 · Answer 4 · answered Dec 27 '17 at 06:42

Regex for parse LCID:

using System;
using System.Text.RegularExpressions;

public class Example {
    public static void Main()
    {
        string pattern = @"(.*)\\(?<lcid>(?<locale>[a-z]{2})-?(?<region>[A-Z]{2})?)\\(.*)";
        string input = @"C:\MainFolder\Folder\en\translations.json C:\MainFolder\Folder\en-AU\translations.json";

        foreach (Match m in Regex.Matches(input, pattern))
        {
            Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
        }
    } 
}

regular expression to detect ISO language code

4 Answers4