Convert language code three characters (ISO 639-2) to two-character code (ISO 639-1)

Question

I'm developing an android app using Text-to-Speech (TTS) engine. TTS component return the list of available languages as list of Locale objects.

But both methods Locale::getLanguage and Locale::getISO3Language of each Locale object return the same 3-character code (ISO 639-2). Usually getLanguage() return the language code in 2-character format (ISO 639-1) but for a particular device the code is three characters. Same for country code. However I need to have the language and country code in two character format (ISO 639-1).

Someone know a way to make a conversion? Please note, I need a corresponding Locale object with both language and country codes in two letter format.

I see that [`Locale::getLanguage`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/Locale.html#getLanguage()) oddly does not specify if it returns the 2 character code or the 3 character code. It mentions ISO 639, and uses a 2-letter code example, but does not actually say 2 or 3. Yet the [`Locale.getISOLanguages`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/Locale.html#getISOLanguages()) does specify that it does return 2-letter codes. — Basil Bourque, Jul 19 '20 at 15:46
Perhaps you should edit your Question to specify the problematic device and its version of Android. — Basil Bourque, Jul 19 '20 at 15:50
Indeed the documentation say Locale::getLanguage can return both type of codes than is not a bug. Usually all the devices I found return the two letters code than I didn't note the "problem" but lately I found a Samsung device with Android 10 returning three digits code. — Suppaman, Jul 19 '20 at 16:17
No, [`Locale:getLanguage`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/Locale.html#getLanguage()) does not say it can return either 2 or 3 letters. It does not say anything about length nor about ISO 639-1 versus 639-2. Can you cite your reference? — Basil Bourque, Jul 19 '20 at 19:16
In the android documentation [Locale#getLanguage()](https://developer.android.com/reference/java/util/Locale#getLanguage()) seem to talk about a generic ISO 639 code returned. Currently all android devices I found returned the two digits code but some days ago I found a Samsung s10e that return the three digits code. — Suppaman, Jul 20 '20 at 19:48
[That Android doc](https://developer.android.com/reference/java/util/Locale#getLanguage()) reads the same as the [Java doc page](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/Locale.html#getLanguage()), a mention of "ISO 639" and an example of 2-letter code. From my overall reading, I suspect the authors of the spec intended the 2-letter code to be the driving force, with 3-letter as a variation. But unfortunately they failed to document that method properly. — Basil Bourque, Jul 20 '20 at 21:28
What do you get on the problem device when calling: `Locale.CANADA_FRENCH.toString()` ? Two or three letter codes? Edit your Question to show code and result. And please document the device and Android version causing the problem. — Basil Bourque, Jul 20 '20 at 21:33

Basil Bourque · Answer 1 · 2020-07-20T21:32:35.873

tl;dr

As a workaround, make your own Map< Locale , String> mapping each known Locale to its 2-letter language code per ISO 639-1.

new LocaleLookup().lookupTwoLetterLanguageCode( Locale.CANADA_FRENCH )

fr

Or maybe just parse the text of Locale::toString.

Locale
.CANADA_FRENCH      
.toString()         // fr_CA
.split( "_" )       // Array: { "fr" , "CA" }
[ 0 ]               // Grab first element in array, "fr".

fr

For two-letter country code, use the second part of that split string. Use index of 1 instead of 0.

Locale
.CANADA_FRENCH      
.toString()         // fr_CA
.split( "_" )       // Array: { "fr" , "CA" }
[ 1 ]               // Grab first element in array, "CA".

CA

Bug?

It seems to be a bug that Locale::getLanguage would return a 3-letter code. The Javadoc uses a 2-letter code in its code example. But unfortunately the Javadoc fails to specify explicitly 2 or 3 letters. I suggest you file a request with the OpenJDK project to clarify this Javadoc.

Workaround

As a workaround, perhaps you could call Locale.getISOLanguages to get an array of all known languages in 2-letter codes. Then loop those. For each, use the code seen in the Javadoc, passing 2-letter code to constrict a Locale object for comparison:

if (locale.getLanguage().equals(new Locale("he").getLanguage()))

From this build your own Map between locale and 2-letter code.

Example class

Here is my first stab at such a workaround map.

In the constructor, we get a list of all known locales, and all known 2-letter ISO 639-1 language codes.

Next we do a nested loop. For each locale, we loop all the 2-letter language codes until we find a match. Notice that we do not do a string match. The Javadoc warns us that the ISO 639 standard is not stable; the codes are changing. Quoting:

Note: ISO 639 is not a stable standard— some languages' codes have changed. Locale's constructor recognizes both the new and the old codes for the languages whose codes have changed, but this function always returns the old code. If you want to check for a specific language whose code has changed, don't do

if (locale.getLanguage().equals("he")) // BAD!

Instead, do

if (locale.getLanguage().equals(new Locale("he").getLanguage())) // GOOD.

So our inner loop looks at each known 2-letter language code, and gets a Locale object for that language. Then our if statement compares the output of getLanguage for (a) our outer loop’s Locale, and (b) our inner loop’s generated language-only Locale (generated by our 2-letter code). In your case, you claim some device is outputting 3-letter code value for our call to getLanguage. But whether 2 or 3 letters, does not matter. We are just looking for a match.

Once instantiated, we can ask our LocaleLookup instance for the two-letter code matching a particular Locale by calling the lookupTwoLetterLanguageCode method.

LocaleLookup localeLookup = new LocaleLookup();
Locale locale = Locale.CANADA_FRENCH;
String code = localeLookup.lookupTwoLetterLanguageCode( locale );

System.out.println( "Locale: " + locale.toString() + " " + locale.getDisplayName( Locale.getDefault() ) + " | ISO 639-1 code: " + code );

Locale: fr_CA French (Canada) | ISO 639-1 code: fr

I'm just guessing at all this. I have not thought it through, nor have I tested any of this. So buyer-beware, this solution is worth every penny you paid for it. Good luck.

Here is the entire class, with a public static void main to use as demonstration.

package work.basil.example;

import java.util.*;

public class LocaleLookup
{
    private Map < Locale, String > mapLocaleToTwoLetterLangCode;

    public LocaleLookup ( )
    {
        this.mapLocaleToTwoLetterLangCode = new HashMap <>( Locale.getAvailableLocales().length );
        this.makeMaps();
        System.out.println( "mapLocaleToTwoLetterLangCode = " + mapLocaleToTwoLetterLangCode );
    }

    private void makeMaps ( )
    {
        // Get all locales.
        Set < Locale > locales = Set.of( Locale.getAvailableLocales() );


        // Get all languages, per 2-letter code.
        Set < String > twoLetterLanguageCodes = Set.of( Locale.getISOLanguages() ); // Returns: An array of ISO 639 two-letter language codes.

        for ( Locale locale : locales )
        {
            for ( String twoLetterLanguageCode : twoLetterLanguageCodes )
            {
                if ( locale.getLanguage().equals( new Locale( twoLetterLanguageCode ).getLanguage() ) )
                {
                    this.mapLocaleToTwoLetterLangCode.put( locale , twoLetterLanguageCode );
                    break;
                }
            }
        }
//        System.out.println( "locales = " + locales );
//        System.out.println( "twoLetterLanguageCodes = " + twoLetterLanguageCodes );
    }

    public String lookupTwoLetterLanguageCode ( final Locale locale )
    {
        String code = this.mapLocaleToTwoLetterLangCode.get( locale );
        Objects.requireNonNull( code );
        return code;
    }


    public static void main ( String[] args )
    {
        LocaleLookup localeLookup = new LocaleLookup();
        Locale locale = Locale.CANADA_FRENCH;
        String code = localeLookup.lookupTwoLetterLanguageCode( locale );

        System.out.println( "Locale: " + locale.toString() + " " + locale.getDisplayName( Locale.getDefault() ) + " | ISO 639-1 code: " + code );
    }
}

And here is the map I produce in a pre-release version of Java 15. Note this may be incorrect, as I have seen some goofiness with locales in the pre-release version.

mapLocaleToTwoLetterLangCode = {nn=nn, ar_JO=ar, bg=bg, zu=zu, am_ET=am, fr_DZ=fr, ti_ET=ti, bo_CN=bo, qu_EC=qu, ta_SG=ta, lv=lv, en_NU=en, en_MS=en, zh_SG_#Hans=zh, ff_LR_#Adlm=ff, en_GG=en, en_JM=en, vo=vo, sd__#Arab=sd, sv_SE=sv, sr_ME=sr, dz_BT=dz, es_BO=es, en_ZM=en, fr_ML=fr, br=br, ha_NG=ha, fa_AF=fa, ar_SA=ar, sk=sk, os_GE=os, ml=ml, en_MT=en, en_LR=en, ar_TD=ar, en_GH=en, en_IL=en, sv=sv, cs=cs, el=el, af=af, ff_MR_#Latn=ff, sw_UG=sw, tk_TM=tk, sr_ME_#Cyrl=sr, ar_EG=ar, sd__#Deva=sd, ji_001=yi, yo_NG=yo, se_NO=se, ku=ku, sw_CD=sw, vo_001=vo, en_PW=en, pl_PL=pl, ff_MR_#Adlm=ff, it_VA=it, sr_CS=sr, ne_IN=ne, es_PH=es, es_ES=es, es_CO=es, bg_BG=bg, ji=yi, ar_EH=ar, bs_BA_#Latn=bs, en_VC=en, nb_SJ=nb, es_US=es, en_US_POSIX=en, en_150=en, ar_SD=ar, en_KN=en, ha_NE=ha, pt_MO=pt, ro_RO=ro, zh__#Hans=zh, lb_LU=lb, sr_ME_#Latn=sr, es_GT=es, so_KE=so, ff_LR_#Latn=ff, ff_GH_#Latn=ff, fr_PM=fr, ar_KM=ar, no_NO_NY=no, fr_MG=fr, es_CL=es, mn=mn, tr_TR=tr, eu=eu, fa_IR=fa, en_MO=en, wo=wo, en_BZ=en, sq_AL=sq, ar_MR=ar, es_DO=es, ru=ru, az=az, su__#Latn=su, fa=fa, kl_GL=kl, en_NR=en, nd=nd, kk=kk, en_MP=en, az__#Cyrl=az, en_GD=en, tk=tk, hy=hy, en_BW=en, en_AU=en, en_CY=en, ta_MY=ta, ti_ER=ti, en_RW=en, sv_FI=sv, nd_ZW=nd, lb=lb, ne=ne, su=su, zh_SG=zh, en_IE=en, ln_CD=ln, en_KI=en, om_ET=om, no=no, ja_JP=ja, my=my, ka=ka, ar_IL=ar, ff_GH_#Adlm=ff, or_IN=or, fr_MF=fr, ms_ID=ms, kl=kl, en_SZ=en, zh=zh, es_PE=es, ta=ta, az__#Latn=az, en_GB=en, zh_HK_#Hant=zh, ar_SY=ar, bo=bo, kk_KZ=kk, tt_RU=tt, es_PA=es, om_KE=om, ar_PS=ar, fr_VU=fr, en_AS=en, zh_TW=zh, sd_IN=sd, fr_MC=fr, kw=kw, fr_NE=fr, pt_MZ=pt, ur_IN=ur, ln=ln, en_JE=en, ln_CF=ln, en_CX=en, pt=pt, en_AT=en, gl=gl, sr__#Cyrl=sr, es_GQ=es, kn_IN=kn, ff__#Adlm=ff, ar_YE=ar, en_SX=en, to=to, ga=ga, qu=qu, ru_KZ=ru, en_TZ=en, et=et, en_PR=en, jv=jv, ko_KP=ko, in=in, sn=sn, ps=ps, nl_SR=nl, en_BS=en, km=km, fr_NC=fr, be=be, gv=gv, es=es, gd_GB=gd, nl_BQ=nl, ff_GN_#Adlm=ff, fr_CM=fr, uz_UZ_#Cyrl=uz, pa_IN_#Guru=pa, en_KE=en, ja=ja, fr_SN=fr, or=or, fr_MA=fr, pt_LU=pt, ff_GM_#Adlm=ff, fr_BL=fr, en_NL=en, ln_CG=ln, te=te, sl=sl, ha=ha, mr_IN=mr, ko_KR=ko, el_CY=el, ku_TR=ku, es_MX=es, es_HN=es, hu_HU=hu, ff_SN=ff, sq_MK=sq, sr_BA_#Cyrl=sr, fi=fi, bs__#Cyrl=bs, uz=uz, et_EE=et, sr__#Latn=sr, en_SS=en, bo_IN=bo, sw=sw, fy_NL=fy, ar_OM=ar, tr_CY=tr, rm=rm, fr_BI=fr, en_MG=en, uz_UZ_#Latn=uz, bn=bn, de_IT=de, kn=kn, fr_TN=fr, sr_RS=sr, bn_BD=bn, de_CH=de, fr_PF=fr, gu=gu, pt_GQ=pt, en_ZA=en, en_TV=en, lo=lo, fr_FR=fr, en_PN=en, fr_BJ=fr, en_MH=en, zh__#Hant=zh, zh_HK_#Hans=zh, cu_RU=cu, nl_NL=nl, en_GY=en, ps_AF=ps, bs__#Latn=bs, ky=ky, os=os, bs_BA_#Cyrl=bs, nl_CW=nl, ar_DZ=ar, sk_SK=sk, pt_CH=pt, fr_GQ=fr, xh=xh, ki_KE=ki, am=am, fr_CI=fr, en_NG=en, ia_001=ia, en_PK=en, zh_CN=zh, en_LC=en, rw=rw, ff_BF_#Adlm=ff, wo_SN=wo, gv_IM=gv, iw=iw, en_TT=en, mk_MK=mk, sl_SI=sl, fr_HT=fr, te_IN=te, nl_SX=nl, ce=ce, fr_CG=fr, xh_ZA=xh, fr_BE=fr, ff_NE_#Adlm=ff, es_VE=es, mt_MT=mt, mr=mr, mg=mg, ko=ko, en_BM=en, nb_NO=nb, ak=ak, dz=dz, vi_VN=vi, en_VU=en, ia=ia, en_US=en, ff_SL_#Latn=ff, to_TO=to, ff_SN_#Adlm=ff, fr_BF=fr, pa__#Guru=pa, it_SM=it, su_ID=su, fr_YT=fr, gu_IN=gu, ii_CN=ii, ff_CM_#Latn=ff, pa_PK_#Arab=pa, fr_RE=fr, fi_FI=fi, ca_FR=ca, sr_BA_#Latn=sr, bn_IN=bn, fr_GP=fr, pa=pa, tg=tg, fr_DJ=fr, rn=rn, uk_UA=uk, ks__#Arab=ks, hu=hu, fr_CH=fr, en_NF=en, ff_GW_#Adlm=ff, ha_GH=ha, sr_XK_#Cyrl=sr, bm=bm, ar_SS=ar, en_GU=en, nl_AW=nl, de_BE=de, en_AI=en, en_CM=en, cs_CZ=cs, ca_ES=ca, tr=tr, ff_GW_#Latn=ff, rm_CH=rm, ru_MD=ru, ms_MY=ms, ta_LK=ta, en_TO=en, ff_SN_#Latn=ff, ff_SL_#Adlm=ff, cy=cy, en_PG=en, fr_CF=fr, pt_TL=pt, sq=sq, tg_TJ=tg, fr=fr, en_ER=en, qu_PE=qu, sr_BA=sr, es_PY=es, de=de, es_EC=es, ff_CM_#Adlm=ff, lg_UG=lg, ff_NE_#Latn=ff, zu_ZA=zu, fr_TG=fr, su_ID_#Latn=su, sr_XK_#Latn=sr, en_PH=en, ig_NG=ig, fr_GN=fr, zh_MO_#Hans=zh, lg=lg, ru_RU=ru, se_FI=se, ff=ff, en_DM=en, en_CK=en, sd=sd, ar_MA=ar, ga_IE=ga, en_BI=en, en_AG=en, fr_TD=fr, fr_LU=fr, en_WS=en, fr_CD=fr, so=so, rn_BI=rn, en_NA=en, mi_NZ=mi, ar_ER=ar, ms=ms, sn_ZW=sn, iw_IL=iw, ug=ug, es_EA=es, ga_GB=ga, th_TH_TH_#u-nu-thai=th, hi=hi, fr_SC=fr, ca_IT=ca, ff_NG_#Latn=ff, en_SL=en, no_NO=no, ca_AD=ca, ff_NG_#Adlm=ff, zh_MO_#Hant=zh, en_SH=en, qu_BO=qu, vi=vi, sd_PK_#Arab=sd, fr_CA=fr, de_LU=de, sq_XK=sq, en_KY=en, mi=mi, mt=mt, it_CH=it, de_DE=de, si_LK=si, en_AE=en, en_DK=en, so_DJ=so, eo=eo, lt_LT=lt, it_IT=it, en_ZW=en, ar_SO=ar, ro=ro, en_UM=en, ps_PK=ps, eo_001=eo, ee=ee, fr_MU=fr, nn_NO=nn, se_SE=se, pl=pl, en_TK=en, en_SI=en, ur=ur, uz__#Arab=uz, pt_GW=pt, se=se, lo_LA=lo, af_ZA=af, ar_LB=ar, ms_SG=ms, ee_TG=ee, ln_AO=ln, be_BY=be, ff_GN=ff, in_ID=in, es_BZ=es, ar_AE=ar, hr_HR=hr, as=as, it=it, pt_CV=pt, ks_IN=ks, uk=uk, my_MM=my, mn_MN=mn, ur_PK=ur, en_FM=en, da_DK=da, es_PR=es, en_BE=en, ii=ii, fr_WF=fr, tt=tt, ru_BY=ru, fo_DK=fo, ee_GH=ee, en_SG=en, ar_BH=ar, ff_GM_#Latn=ff, om=om, en_CH=en, hi_IN=hi, fo_FO=fo, yo_BJ=yo, fr_KM=fr, fr_MQ=fr, ff_GN_#Latn=ff, en_SD=en, es_AR=es, ff__#Latn=ff, en_MY=en, ja_JP_JP_#u-ca-japanese=ja, es_SV=es, pt_BR=pt, ml_IN=ml, en_FK=en, uz__#Cyrl=uz, is_IS=is, hy_AM=hy, en_GM=en, en_DG=en, fo=fo, ne_NP=ne, pt_ST=pt, hr=hr, ak_GH=ak, lt=lt, uz_AF_#Arab=uz, ta_IN=ta, fr_GF=fr, en_SE=en, zh_CN_#Hans=zh, es_419=es, is=is, pt_AO=pt, si=si, en_001=en, jv_ID=jv, en=en, es_IC=es, fr_MR=fr, ca=ca, ru_KG=ru, ar_TN=ar, ks=ks, zh_TW_#Hant=zh, ff_BF_#Latn=ff, bm_ML=bm, kw_GB=kw, ug_CN=ug, as_IN=as, es_BR=es, zh_HK=zh, sw_KE=sw, en_SB=en, th_TH=th, rw_RW=rw, ar_IQ=ar, en_MW=en, mk=mk, en_IO=en, pa__#Arab=pa, en_DE=en, ar_QA=ar, en_CC=en, ro_MD=ro, en_FI=en, bs=bs, pt_PT=pt, fy=fy, az_AZ_#Cyrl=az, th=th, es_CU=es, ar=ar, en_SC=en, en_VI=en, eu_ES=eu, en_UG=en, en_NZ=en, es_UY=es, sg_CF=sg, ru_UA=ru, sg=sg, uz__#Latn=uz, el_GR=el, da_GL=da, en_FJ=en, de_LI=de, en_BB=en, km_KH=km, hr_BA=hr, de_AT=de, nl=nl, lu_CD=lu, ca_ES_VALENCIA=ca, ar_001=ar, so_SO=so, lv_LV=lv, sd_IN_#Deva=sd, es_CR=es, ar_KW=ar, fr_GA=fr, ar_LY=ar, sr=sr, sr_RS_#Cyrl=sr, en_MU=en, da=da, gl_ES=gl, az_AZ_#Latn=az, en_IM=en, en_LS=en, ig=ig, en_HK=en, en_GI=en, ce_RU=ce, gd=gd, en_CA=en, ka_GE=ka, fr_SY=fr, sw_TZ=sw, so_ET=so, fr_RW=fr, nl_BE=nl, ar_DJ=ar, mg_MG=mg, en_VG=en, cy_GB=cy, cu=cu, sr_RS_#Latn=sr, os_RU=os, en_TC=en, sv_AX=sv, ky_KG=ky, af_NA=af, lu=lu, en_IN=en, yo=yo, ki=ki, es_NI=es, nb=nb, sd_PK=sd, ti=ti, ms_BN=ms, br_FR=br}

Substring of `Locale.toString`?

Now, after having done all that work, I notice that the toString representation of the locale name starts with the two-letter language code!

If this always the case for all Locale objects, we can simply parse that string.

String twoLetterLanguageCode = Locale.CANADA_FRENCH.toString().split( "_" )[ 0 ];

twoLetterCode = fr

For country code, do the same, but pull the second part. Use an index value of 1 versus 0.

String twoLetterCountryCode = Locale.CANADA_FRENCH.toString().split( "_" )[ 1 ];

For this quick check on my pre-release Java 15, it does seem to be the case that every Locale object’s toString text starts with the 2-letter language code. But I do not know if you can count on that always being the case in the past and in the future.

System.out.println( Locale.getAvailableLocales().length ); ArrayList < Locale > problemLocales = new ArrayList <>( Locale.getAvailableLocales().length ); for ( Locale locale : Locale.getAvailableLocales() ) { String parsed = locale.toString().split( "_" )[ 0 ]; if ( ! parsed.equalsIgnoreCase( locale.getLanguage() ) ) { problemLocales.add( locale ); } }

System.out.println( "problemLocales = " + problemLocales );

problemLocales = []

Or, vice-versa:

System.out.println( "Locale.getAvailableLocales().length: " + Locale.getAvailableLocales().length );
ArrayList < Locale > matchingLocales = new ArrayList <>( Locale.getAvailableLocales().length );
for ( Locale locale : Locale.getAvailableLocales() )
{
    String parsed = locale.toString().split( "_" )[ 0 ];
    if ( parsed.equalsIgnoreCase( locale.getLanguage() ) )
    {
        matchingLocales.add( locale );
    }
}

System.out.println( "matchingLocales.size: " + matchingLocales.size() );
System.out.println( "matchingLocales = " + matchingLocales );

Locale.getAvailableLocales().length: 810

matchingLocales.size: 810

I read the documentation too and noted this example but I specified I need a corresponding Locale object with both language and country codes in two letter format. The loop you propose allow to generate a Locale object by using only the language code, on the contrary I need a Locale object with language and country. This because, for example, english language can be from USA or UK countries that are two different Locale objects.... — Suppaman, Jul 19 '20 at 16:21
@Suppaman Your Comment seems to contradict your Question. In your Question, you said, "both methods Locale::getLanguage and Locale::getISO3Language of each Locale object return the same 3-character code (ISO 639-2)" while you wanted to "return the language code in 2-character format (ISO 639-1)" for a particular locale. I added code to my Answer to do just that. But now in your Comment, you are talking about country codes, so I am confused. — Basil Bourque, Jul 19 '20 at 19:08
At first thank you for take so much time in write a so long reply. Probably I didn't write my first message clearly. In the message, after explained the problem with language code I wrote also "Same for country code" and I meant that also the country code is returned in three digits format. However you example code gave me an idea for a possible "conversion" code that I'll try to develop. Thank you again for now. — Suppaman, Jul 20 '20 at 19:55

Convert language code three characters (ISO 639-2) to two-character code (ISO 639-1)

1 Answers1

tl;dr

Bug?

Workaround

Example class

Substring of `Locale.toString`?

Linked

Convert language code three characters (ISO 639-2) to two-character code (ISO 639-1)

1 Answers1

tl;dr

Bug?

Workaround

Example class

Substring of Locale.toString?

Linked

Substring of `Locale.toString`?