How to find if a character belongs to a particular codepage using c++ or calling winapi

Question

How can we find if a character belongs to a particular codepage? or How can we determine whether a charcter fits into currently active IME for an application.

You need to define 'character'. Do you mean you have a UTF-16 or UTF-8 multibyte character, and you want to know if that translates to a point in a given Windows code page? — richb, Mar 10 '10 at 12:18
yes that is right, the character could be UTF-8 character and I need to find out if it translates to a codepoint in a given windows codepage. — Prakash, Mar 10 '10 at 12:31

score 3 · Answer 1 · answered Mar 10 '10 at 14:02

First, Convert your UTF-8 string of characters to UTF-16 using MultiByteToWideChar
Now, reverse the process using WideCharToMultiByte passing the desired codepage as the first parameter.

Use the WC_ERR_INVALID_CHARS flag and WideCharToMultiByte will fail outright if any invalid characters are used. If you want to know which characters are not represented in the target codepage, use the lpDefaultChar, and lpUsedDefaultChar parameters.

LPCWSTR pszUtf16; // converted from utf8 source character
UINT nTargetCP = CP_ACP;
BOOL fBadCharacter = FALSE;
if(WideCharToMultiByte(nTargetCP,WC_NO_BEST_FIT_CHARS,pszUtf16,NULL,0,NULL,&fBadCharacter)
{
  if(fBadCharacter)
  {
    // at least one character in the string was not represented in nTargetCP
  } 
}

score 2 · Answer 2 · answered Mar 11 '10 at 05:34

The two previous answers have correctly suggested using MultiByteToWideChar then WideCharToMultiByte to translate your UTF-8 character to UTF-16, then to the current Windows codepage (CP_ACP). Check the result of WideCharToMultiByte to see if the conversion was successful.

What wasn't clear from the original question, is that you are having a particular issue with Hindi. For this language, your question is meaningless because there is no Windows ANSI codepage for Hindi, as Chris Becke pointed out. Therefore, you can never convert a Hindi character to CP_ACP, and WideCharToMultiByte will always fail.

To use Hindi on Windows, as far as I understand it, you must be a Unicode app that calls Unicode APIs.

score 0 · Answer 3 · answered Mar 10 '10 at 13:36

0

Using the windows functions WideCharToMultiByte and MultiByteToWideChar you can convert between UTF-8 and 16-bit Unicode characters. The functions have arguments to specify the code page and to specify the behavior if an invalid character is encountered.

answered Mar 10 '10 at 13:36

Patrick

23,217
12
67
130

Thanks , Yes you are right , i was using LPBOOL lpUsedDefaultChar parameter of the WideCharToMultiByte() to determine the same , however for Hindi IME that has code page 0 , the result lpUsedDefaultChar is always true. [Not sure how my previous comment got removed :( but I had mentioned it in details there] – Prakash Mar 10 '10 at 13:44

score 0 · Answer 4 · edited Jul 07 '10 at 14:04

Thanks Chris..I am running the following code

#define CP_HINDI 0 
#define CP_JAPANESE 932
#define CP_ENGLISH 1252

wchar_t wcsStringJapanese = 'あ';
wchar_t wcsStringHindi = 'र';
wchar_t wcsStringEnglish = 'A';

int main()  
{ 

    BOOL usedDefaultCharacter = FALSE;

    /* Test for ENGLISH */
    WideCharToMultiByte( CP_ENGLISH,
                        0, &wcsStringEnglish,
                        -1,  
                        NULL,
                        0, 
                        NULL, 
                        &usedDefaultCharacter); 
    printf("usedDefaultCharacters for English? %d \n",usedDefaultCharacter);

    usedDefaultCharacter = FALSE;

    /*TEST FOR JAPANESE */

     WideCharToMultiByte( CP_JAPANESE,
                         0,
                         &wcsStringJapanese,
                        -1,  
                        NULL,
                        0, 
                        NULL, 
                        &usedDefaultCharacter); 
    printf("usedDefaultCharacters for Japanese? %d \n",usedDefaultCharacter);

    //TEST FOR HINDI 
    usedDefaultCharacter = FALSE;

    WideCharToMultiByte( CP_HINDI,
                        0, 
                        &wcsStringHindi,
                        -1,  
                        NULL,
                        0, 
                        NULL, 
                        &usedDefaultCharacter); 
    printf("usedDefaultCharacters for Hindi? %d \n",usedDefaultCharacter);   

}

The above code returns:

usedDefaultCharacters for English? 0

usedDefaultCharacters for Japanese? 0

usedDefaultCharacters for Hindi? 1

The third line is incorrect as the Codepage for Hindi is 0 , and the string passed consists of Hindi Character and still the usedDefaultChar is set to 1 .. which should not be the case.

The codepage for hindi is NOT zero. Hindi is one of the new 'unicode only' localizations. There is no actual windows ansi codepage for representing hindi characters. Refer to this page: http://msdn.microsoft.com/en-us/goglobal/bb688174.aspx — Chris Becke, Mar 10 '10 at 15:18
so is there any value that I can give for the "codepage" parameter of WideCharToMultiByte to find out if the current encoding supports the Hindi Character? Or is there a way (in c++) to find out what if the current encoding of the page is UNICODE? -Thanks — Prakash, Mar 10 '10 at 17:03

How to find if a character belongs to a particular codepage using c++ or calling winapi

4 Answers4