3

How do I get the code page for each locale (not only for my locale)?

I looking for a simple function in python / c# / c (prefer with python or c#) to found for each locale that I want what is the code page in Ansi and in OEM.

Thomas Dickey
  • 51,086
  • 7
  • 70
  • 105

1 Answers1

3

In C, starting with Windows Vista, you can query information for a locale name via GetLocaleInfoEx. The locale information constants to query the ANSI and OEM codepages are, respectively,LOCALE_IDEFAULTANSICODEPAGE (0x1004) and LOCALE_IDEFAULTCODEPAGE (0x000B). You can enumerate all system locale names via EnumSystemLocalesEx and query the ANSI and OEM codepages for each locale in the callback.

In a Python script, you can call these functions via ctypes. For example:

import ctypes
from ctypes import c_int
from ctypes.wintypes import BOOL, DWORD, LPVOID, LPWSTR, LPARAM, WCHAR

kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)

CP_ACP = 0
CP_OEMCP = 1
LOCALE_NAME_USER_DEFAULT = None
LOCALE_NAME_SYSTEM_DEFAULT = "!x-sys-default-locale"
LOCALE_RETURN_NUMBER = 0x20000000
LOCALE_IDEFAULTCODEPAGE = 0x0000000B
LOCALE_IDEFAULTANSICODEPAGE = 0x00001004
LOCALE_SENGLISHLANGUAGENAME = 0x00001001
LOCALE_SENGLISHCOUNTRYNAME = 0x00001002

LOCALE_ENUMPROCEX = ctypes.WINFUNCTYPE(BOOL, 
    LPWSTR, # lpLocaleString
    DWORD,  # dwFlags
    LPARAM) # lParam

def _check_zero(result, func, args):
    if not result:
        raise ctypes.WinError(ctypes.get_last_error())
    return args

kernel32.EnumSystemLocalesEx.errcheck = _check_zero
kernel32.EnumSystemLocalesEx.argtypes = (
    LOCALE_ENUMPROCEX, # lpLocaleEnumProcEx
    DWORD,             # dwFlags
    LPARAM,            # lParam
    LPVOID)            # lpReserved

LCTYPE = DWORD
kernel32.GetLocaleInfoEx.errcheck = _check_zero
kernel32.GetLocaleInfoEx.argtypes = (
    LPWSTR, # lpLocaleName,
    LCTYPE, # LCType,
    LPVOID, # lpLCData,
    c_int)  # cchData

def get_language(locale=LOCALE_NAME_SYSTEM_DEFAULT):
    length = kernel32.GetLocaleInfoEx(locale, LOCALE_SENGLISHLANGUAGENAME, 
        None, 0)
    language = (WCHAR * length)()
    kernel32.GetLocaleInfoEx(locale, LOCALE_SENGLISHLANGUAGENAME, 
        language, length)
    return language.value

def get_country(locale=LOCALE_NAME_SYSTEM_DEFAULT):
    length = kernel32.GetLocaleInfoEx(locale, LOCALE_SENGLISHCOUNTRYNAME, 
        None, 0)
    country = (WCHAR * length)()
    kernel32.GetLocaleInfoEx(locale, LOCALE_SENGLISHCOUNTRYNAME, 
        country, length)
    return country.value

def get_acp(locale=LOCALE_NAME_SYSTEM_DEFAULT):
    cp_ansi = DWORD()
    kernel32.GetLocaleInfoEx(locale, LOCALE_IDEFAULTANSICODEPAGE | 
        LOCALE_RETURN_NUMBER, ctypes.byref(cp_ansi), 
        ctypes.sizeof(cp_ansi) // ctypes.sizeof(WCHAR))
    return cp_ansi.value

def get_oemcp(locale=LOCALE_NAME_SYSTEM_DEFAULT):
    cp_oem = DWORD()
    kernel32.GetLocaleInfoEx(locale, LOCALE_IDEFAULTCODEPAGE | 
        LOCALE_RETURN_NUMBER, ctypes.byref(cp_oem), 
        ctypes.sizeof(cp_oem) // ctypes.sizeof(WCHAR))
    return cp_oem.value

def list_system_locales():
    system_locales = []
    @LOCALE_ENUMPROCEX
    def enum_cb(locale, flags, param):
        system_locales.append((locale, 
            get_language(locale), get_country(locale), 
            get_acp(locale), get_oemcp(locale)))
        return True
    kernel32.EnumSystemLocalesEx(enum_cb, 0, 0, None)
    return sorted(system_locales)

Note that Unicode-only locales do not have an ANSI or OEM codepage. In this case the values returned are for the current system ANSI and OEM codepages, i.e. CP_ACP (0) and CP_OEMCP (1). For example, the Hindi (hi) language in India (IN) is a Unicode-only locale:

>>> (get_acp('hi-IN'), get_oemcp('hi-IN')) == (CP_ACP, CP_OEMCP)
True
Eryk Sun
  • 33,190
  • 5
  • 92
  • 111
  • thank you so much about it! is there any way to get on the list that `list_system_locales` return the full country(lang) like i see it on windows locale windowand not like `ff-NG` or `en-MG` , and is that list is correct for all windows versions or it depend on somthing? – g319909.nwytg.coM Oct 05 '18 at 12:49
  • `EnumSystemLocalesEx` enumerates the locales available on the current system; it won't be the same on all systems. Also, the enumeration can be narrowed down. I had it enumerate all locales, including variants with an alternate sort. When I have time, I'll add a function to get the full language and country name. – Eryk Sun Oct 05 '18 at 14:54
  • I will thank you to see the full language and country. What I mean is if I see that in `ff-NG` it use in ansi codepage X and wity oem codepage Y, so in all windows version that true? Or maybe in another version it use another codepage – g319909.nwytg.coM Oct 06 '18 at 16:38
  • I don't know the deep history of codepage assignment way back to Windows NT 3.1 in 1993. Codepages are legacy locale data, so I think they've been stable since Windows Vista at least. My updated answer requires Windows 7+. Locale names were introduced in Vista, in place of local IDs (LCIDs), but the information constants I'm using for the English names of languages and countries were introduced in Windows 7. – Eryk Sun Oct 06 '18 at 16:50
  • 1) When I running this code in win 7 I get list with 360 items, when I running it in win 10 I get a list with 850 items, so there is a locale that exist in win 10 but not in win 7? 2) to 'Embu', 'Kenya' I see in ansi code page 0 is that wrong? I see here en.wikipedia.org/wiki/Code_page all the code page that windows have , and not see code page 0 thank you a lot! – g319909.nwytg.coM Oct 15 '18 at 20:15
  • (1) The difference is probably a combination of completely new locales plus alternate-sort variants. (2) I already addressed Unicode-only locales in the answer, with an example of "hi-IN". Codepage 0 is `CP_ACP`, a special value that means the ANSI codepage of the current system locale, which cannot be set to a Unicode-only locale. – Eryk Sun Oct 15 '18 at 20:17
  • What do you mean `the ANSI codepage of the current system locale, which cannot be set to a Unicode-only locale` ? If I will try to encode with ansi in windows pc that locale in "hi-IN" , What will happend? With which ansi codepage windows will use? – g319909.nwytg.coM Oct 17 '18 at 15:59
  • Older versions of Windows didn't let you set the system locale to hi-IN (Hindi, India) because the Devanagari codepage (57002 ISCII) [isn't compatible](http://archives.miloush.net/michkap/archive/2005/10/28/486232.html) with the WinAPI ANSI system. In Windows 10, it's now possible to set the system locale to Hindi, since the system can use codepage 65001 (UTF-8) as the ANSI and OEM codepages, a new feature that's still being beta tested. – Eryk Sun Oct 17 '18 at 16:28
  • When I running you python scrypt in win 10 I get that ANSI code page to hi-IN it 0 , so maybe I need to upgrade my windws and then I will see 57002 ? that strange because in https://en.wikipedia.org/wiki/Code_page#Windows_code_pages I see that in windows there are only 10 code page ... 874, 1250,1251,1252,1253,1254,1255,1256,1257,1258 – g319909.nwytg.coM Oct 17 '18 at 16:40
  • If you get `CP_ACP` (0) for the ANSI codepage or `CP_OEMCP` (1) for the OEM codepage, it's clearly documented (e.g. see [`WideCharToMultiByte`](https://learn.microsoft.com/en-us/windows/desktop/api/stringapiset/nf-stringapiset-widechartomultibyte)) that these values will use the ANSI and OEM codepages of the system locale (set in Region->Administrative). On systems prior to Windows 10, which do not support using UTF-8 for the ANSI and OEM codepages, the system locale must be something else (e.g. "en-IN" with ANSI and OEM codepages 1252 and 437). – Eryk Sun Oct 17 '18 at 16:44
  • It's not due to your system that "hi-IN" has no specific ANSI or OEM codepage defined. Again, the implementation of codepage 57002 (ISCII Devanagari) is not compatible with the legacy ANSI/OEM system used by the Windows API, which makes "hi-IN" a Unicode-only locale. In Windows 10, the applet that sets the system locale sees that "hi-IN" is a Unicode-only locale, so it enables the (still in beta) support for setting the ANSI and OEM codepages to 65001 (UTF-8, a Unicode encoding). – Eryk Sun Oct 17 '18 at 16:48
  • I understand you said that "hi-IN" is not ansi , is unicode-only locale , but if in win 7/win 10 i will set local "hi-IN" and use a function that encoding with ansi, what will happend? will I get a exeption? or all the byte between 0x00 to 0xff stay what they was and not replace because they not map in current code page ? – g319909.nwytg.coM Oct 17 '18 at 16:54
  • In Windows 7, we cannot set the system to locale to "hi-IN". As I said, it must be set to something else such as "en-IN". When you query the ANSI codepage for "hi-IN", on all systems it will return `CP_ACP` (0). When you use this special codepage value with `WideCharToMultiByte`, it uses the system locale codepage, whatever that is. You can query it via `GetACP`. – Eryk Sun Oct 17 '18 at 16:58
  • thank you about all of your replay ! , Just to understand , I see here https://en.wikipedia.org/wiki/Code_page#Windows_code_pages that in windows there are only 10 code page ... 874, 1250,1251,1252,1253,1254,1255,1256,1257,1258 , so why do I see that in `zh-TW_pronun` the ansi code page is 950? or in `zh-SG` the ansi code page is 936 ? – g319909.nwytg.coM Dec 05 '18 at 12:07