6

I'm trying to detect whether a combo box contains an ISO language code (i.e. en-GB, el-GR, ru-RU etc), which comprises of 2 alphabetical characters, a dash, and 2 more alphabetical characters (in upper case, or it might not matter?).

I was wondering, is there a way I can achieve this using regular expressions?

I'm assuming the expression would look something like this (but I don't have much experience in the subject):

string pattern = @"^\a{2,2}-\a{2,2}";
neo571
  • 69
  • 1
  • 7
Themos
  • 430
  • 1
  • 9
  • 18
  • 1
    I can't take credit for this, but see http://www.pelagodesign.com/blog/2009/05/20/iso-8601-date-validation-that-doesnt-suck/. Also try Google before StackOverflow :-) – pixelbadger Mar 14 '13 at 07:06
  • Note that what you do is detect something that looks like language code, but not validate whether it is a real language code. – nhahtdh Mar 14 '13 at 08:20

4 Answers4

17

Something like so should work: ^[a-z]{2}-[A-Z]{2}$.

The ^ anchor instructs the regex engine to start matching from the beginning of the string, [a-z] means any lower case letter between a and z. {2} means exactly 2 repetitions of. The same explanation holds for the rest. Finally, the $ instructs the regex engine to stop matching at the end of the string.

Anirvan
  • 6,214
  • 5
  • 39
  • 53
npinti
  • 51,780
  • 5
  • 72
  • 96
  • better use /gi option to do case-insensitive search if its a URL you work upon and to do global find (not sure if g is needed when replacing with string's replace in javascript) – George Birbilis Aug 19 '14 at 15:00
  • 2
    According to https://www.andiamo.co.uk/resources/iso-language-codes/, not all codes have the second part. – mythofechelon Oct 07 '21 at 21:57
  • 1
    This does not account for codes that don't have the second part, it would only work for those like `en-US` – wtfzambo Dec 02 '21 at 11:13
6

Accepted solution by @npinti could be not accurate enough if we take a closer look to the list of ISO 639x codes here. Alternatively you can get a culture list on your own by invoking the static method below (C# code):

System.Globalization.CultureInfo.GetCultures(CultureTypes.AllCultures);

Among the retrieved values, you will find non matching samples as "Cy-az-AZ" (3 codes!), "zh-CHS" (3 letters!) or "en-029" (numbers!). Curiously enough, the one with numbers does not appear in the MS link above, even though is retrieved by the CultureInfo method.

This article from here discusses the one with numbers.

So it doesn't seem an easy issue. We could try with a slightly more complex regex as the one shown below, but this doesn't guarantee that we'll be able to distinct an ISO culture code against whatever other thing. IMO, if we really have the need to be 100% reliable, probably the only choice is to seek that code into the list of codes to find an exact match.

Regex option:

^[^-]{2,3}-[^-]{2,3}(-[^-]{2,3})?$

Find option:

public static bool IsCultureCode(string code)
{
    CultureInfo[] cultures = CultureInfo.GetCultures(CultureTypes.SpecificCultures); //AllCultures
    int i = 0;
    while(i < cultures.Length && !cultures[i].Name.Equals(code, StringComparison.InvariantCultureIgnoreCase))
        i++;
    return i < cultures.Length;
}
Mario Vázquez
  • 717
  • 10
  • 9
6

^[a-z]{2}(-[A-Z]{2})?$

  • first two chars must exist and be lowercase
  • last two chars if exist, must be uppercase and separated from the first 2 with a hyphen

matches:

  • en
  • en-US
  • tr
  • tr-TR
  • ru
Mehmet N. Yarar
  • 576
  • 9
  • 17
0

Regex for parse LCID:

using System;
using System.Text.RegularExpressions;

public class Example {
    public static void Main()
    {
        string pattern = @"(.*)\\(?<lcid>(?<locale>[a-z]{2})-?(?<region>[A-Z]{2})?)\\(.*)";
        string input = @"C:\MainFolder\Folder\en\translations.json C:\MainFolder\Folder\en-AU\translations.json";

        foreach (Match m in Regex.Matches(input, pattern))
        {
            Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
        }
    } 
}
flibustier
  • 33
  • 4