What data type should I use for IETF language codes?

Question

I'm designing a schema for a message on a microblogging platform, which will need to have a defined language. These messages will be distributed across networks between many nodes, so I need to make the schema compact but still completely multilingual.

I'm going to use the IETF language codes (en, en-AU etc.), but I need to know if there is a specific way to represent them for the purposes of efficiency. There are multiple standards for language tags, but the current specification RFC 5646 is convoluted by maintaining backwards-compatibility with the previous standards. I don't exactly understand the space requirements as there are multiple subtags.

What is the most space-efficient way to represent an IETF language code?

score 24 · Accepted Answer · edited Oct 07 '21 at 07:59

I think IETF specs for handling the locale codes is indeed the industry "Best Common Practice", but definitely not without compromises to maintain backwards-compatibility and such. I still recommend adapting it to your needs since the most important internationalization libraries and standards (Unicode, ICU) are using it.

BCP47/RFC5646 section 4.4.1 recommends a 35 characters tag length:

   language      =  8 ; longest allowed registered value
                      ;   longer than primary+extlang
                      ;   which requires 7 characters
   script        =  5 ; if not suppressed: see Section 4.1
   region        =  4 ; UN M.49 numeric region code
                      ;   ISO 3166-1 codes require 3
   variant1      =  9 ; needs 'language' as a prefix
   variant2      =  9 ; very rare, as it needs
                      ;   'language-variant1' as a prefix

   total         = 35 characters

              Figure 7: Derivation of the Limit on Tag Length

But in case you only care about language and script (rather than region information which denotes some of locale-sensitive data like date and time formats), then you can make do with 13 characters max.

In reality most of the tags will end up being only two characters for the language. The only common examples which I deal with regularly and require script subtags are sr-Latn and sr-Cyrl (respectively, Serbian written in Latin or Cyrillic), zh-Hant (Traditional Chinese), and zh-Hans (Simplified Chinese). Also, most probably you will not need the variants which means that most of the real world examples of these locale codes should fall under a 17 characters limit.

What data type should I use for IETF language codes?

1 Answers1