0

I'm a newbie with C++ and I've taken over a COM project to fix some issues. The current issue I'm working on is handling UTF8 strings. I have this piece of code:

// CString strValue;
CStringW strValue; 
CComVariant* val = &(*result)[i].minValue;
switch (val->vt)
{
case VT_BSTR:   
    //strValue = OLE2CA(val->bstrVal);
    strValue = OLE2W(val->bstrVal); // Works
    (*result)[i].name = strValue; // Works
    (*result)[i].expression = "[" + fieldName + "] = \"" + strValue + "\""; // fails
    break;
case VT_R8:     
    //strValue.Format("%g", val->dblVal);
    strValue.Format(L"%g", val->dblVal); // Works
    (*result)[i].name = strValue; // Works
    (*result)[i].expression = "[" + fieldName + "] = " + strValue; // fails
    break;
case VT_I4:     
    //strValue.Format("%i", val->lVal);
    strValue.Format(L"%i", val->lVal); // Works
    (*result)[i].name = strValue; // Works
    (*result)[i].expression = "[" + fieldName + "] = " + strValue; // fails
    break;
}

struct CategoriesData
{
    public:
    CComVariant minValue;
    CComVariant maxValue;
    //CString expression;
    CStringW expression;
    //CString name;
    CStringW name;
    tkCategoryValue valueType;
    int classificationField;
    bool skip;
};

The problem is with this line strValue = OLE2CA(val->bstrVal); When val->bstrVal is an unicode string like this Russian text Воздух strValue is converted into ?????.

I tried several approached and searched the internet, but can't get strValue to be Воздух. Can a CString contain this kind of text or should I change to another type? Is so which one?

minValue can be a VT_BSTR, a VT_R8 or a VT_I4.

These are the options I tried so far:

strValue = val->bstrVal;
strValue = Utility::ConvertFromUtf8(val->bstrVal);
strValue = Utility::ConvertToUtf8(val->bstrVal);
temp = Utility::ConvertBSTRToLPSTR(val->bstrVal);
strValue = W2BSTR(Utility::ConvertFromUtf8(temp));
strValue = W2BSTR(val->bstrVal);                
strValue = CW2A(val->bstrVal);
strValue = (CString)val->bstrVal;
strValue = Utility::ConvertToUtf8(OLE2W(val->bstrVal));

Edit The code for the helper functions:

CStringA ConvertToUtf8(CStringW unicode) {
    USES_CONVERSION;
    CStringA utf8 = CW2A(unicode, CP_UTF8);
    return utf8;
}

CStringW ConvertFromUtf8(CStringA utf8) {
    USES_CONVERSION;
    CStringW unicode = CA2W(utf8, CP_UTF8);
    return unicode;
}

char* ConvertBSTRToLPSTR (BSTR bstrIn)
{
  LPSTR pszOut = NULL;
  if (bstrIn != NULL)
  {
    int nInputStrLen = SysStringLen (bstrIn);

    // Double NULL Termination
    int nOutputStrLen = WideCharToMultiByte(CP_ACP, 0, bstrIn, nInputStrLen, NULL, 0, 0, 0) + 2; 
    pszOut = new char [nOutputStrLen];

    if (pszOut)
    {
      memset (pszOut, 0x00, sizeof (char)*nOutputStrLen);
      WideCharToMultiByte (CP_ACP, 0, bstrIn, nInputStrLen, pszOut, nOutputStrLen, 0, 0);
    }
  }
  return pszOut;
}

Edit2 I added my complete switch statement. When I change strValue from CString to CStringW I get errors for the other cases, like strValue.Format("%g", val->dblVal); How to solve this?

Edit3 I already fixed a similar issue, but that was converting to VARIANT not from:

    val->vt = VT_BSTR;
    const char* v = DBFReadStringAttribute(_dbfHandle, _rows[RowIndex].oldIndex, _fields[i]->oldIndex);
    // Old code, not unicode ready:
    //WCHAR *buffer = Utility::StringToWideChar(v);
    //val->bstrVal = W2BSTR(buffer);
    //delete[] buffer;              
    // New code, unicode friendly:
    val->bstrVal = W2BSTR(Utility::ConvertFromUtf8(v)); 

Edit4 Thanks to all the help so far I managed to make some changes. I've updated my initial code in this post and added all code of the function. I'm now stuck with this line:

 (*result)[i].expression = "[" + fieldName + "] = \"" + strValue + "\"";    

I can't concatenate CStringW values.

Some more background info: The function is part of MapWinGIS, an Open Source GIS application, where you can show maps (shapefiles). These maps have attribute data. This data is stored in DBase IV format and can hold unicode/UTF-8 text. I already made a fix (see Edit3) to show this text properly in a grid view. The function I'm struggling now is categorizing (grouping) the data to, for example give similar values the same color. This category has a name and an expression. This expression is later on parsed to do the actual grouping. For example I have a map with states and I want to give each state a different color. As mentioned before, I'm new to C++ and am really outside my comfort zone. I really appreciate all the help you have given me. I hope you will help me once more.

Paul Meems
  • 3,002
  • 4
  • 35
  • 66
  • What does bstrVal contains initially? a UTF8 string? how did you put in there in the first place). Windows doesn't have a native UTF8 string type, so you have to convert somehow, and it can work, so how is Utility::ConvertToUtf8 coded? – Simon Mourier Aug 03 '17 at 12:53
  • I added the code for the helper functions – Paul Meems Aug 03 '17 at 13:58
  • BSTR is stored as UTF-16 , you shouldn't be trying any UTF8 stuff. (Well I guess it's possible someone copied UTF-8 bytes into a BSTR but that would be exceptionally awful) – M.M Aug 03 '17 at 22:02
  • The answer depends on whether you are in a Unicode project or not. If you are, then `CString` means `CStringW` and you would not use any UTF-8 coversion; otherwise it means `CStringA` and you would. IMO it is preferable to use Unicode project, but if you're working on legacy code that was initially built as non-Unicode then you might be stuck with that. Please clarify which case it is for you, since the answer is different in either. (You could support both via overloaded conversion function) – M.M Aug 03 '17 at 22:15
  • If it is a unicode project then it's unclear to me what you're trying to do with UTF-8 in a CStringW – M.M Aug 03 '17 at 22:15

3 Answers3

2

BSTRs "naturally" store Unicode UTF-16 length-prefixed strings, although you could "stretch" a BSTR and store with it a more generic length-prefixed sequence of bytes (but I don't like this usage).

(For more details on BSTRs, you will find this blog post by Eric Lippert very interesting.)

So, I'm considering the normal usage of BSTR, which is storing length-prefixed UTF-16 strings.

If you want to convert a UTF-16 string stored in a BSTR to a UTF-8 string, you can use the WideCharToMultiByte Win32 API with the CP_UTF8 flag (see e.g. this MSDN Magazine article for details, and this reusable code on GitHub).

You can store the destination UTF-8 string in instances of the std::string class.

P.S. If you want to use CStringW for UTF-16 and CStringA for UTF-8 strings, and the ATL CW2A helper for UTF-16/8 conversions, note that you don't need the USES_CONVERSION macro in your code; and you could just take input strings by const& (const reference) as good code hygiene:

CStringA Utf8FromUtf16(const CStringW &utf16) {
    CStringA utf8 = CW2A(utf16, CP_UTF8);
    return utf8;
}

RE Edit 2

Try strValue.Format(L"%g",... with CStringW. The L prefix generates a Unicode UTF-16 string literal for CStringW::Format.

RE Edit 4

I replied to that in the comments, but for the sake of completeness, to concatenate string literals with CStringW instances, consider decorating these literals with L"...": this defines a Unicode UTF-16 string literal, which is wchar_t-based, and works fine with CStringW objects.

(*result)[i].expression = L"[" + fieldName + L"] = \"" + strValue + L"\"";    
Mr.C64
  • 41,637
  • 14
  • 86
  • 162
  • I tried your `Utf8FromUtf16' and it returns `"Вода" ATL::CSimpleStringT: "Вода"` – Paul Meems Aug 03 '17 at 15:31
  • How did you visualize the content of the returned CStringA string? Note that a sequence of (chars) bytes can be displayed in different ways based on the _encoding_. Have you checked that the actual bytes in the CStringA represent the expected UTF-8 encoding sequence of the source UTF-16 string? – Mr.C64 Aug 03 '17 at 15:45
  • In addition to my previous comment, consider using the `s8` flag when e.g. printing the UTF-8 string in the VS command window, for example: `? str.GetString(), s8`. (You may read [this blog post](https://blogs.msmvps.com/gdicanio/2016/11/22/whats-wrong-with-my-utf-8-strings-in-visual-studio/) for more details.) – Mr.C64 Aug 03 '17 at 15:49
  • I use VS2013 to debug the code and set a breakpoint and then with a quick watch I can see the value. The method is called using an executable written in C# which only calls this specific method. In the end the data is viewed in a larger C# application. BTW. all code is open-source it is the MapWindow GIS project: https://github.com/MapWindow – Paul Meems Aug 03 '17 at 19:59
  • I've updated my post again (see Edit4) and added more code. I'm now struggling with concatenating the CStringW values together. I had hoped this would be a small fix, but it turns out to be a huge endeavor. Especially when you don't know what you are doing ;) – Paul Meems Aug 11 '17 at 07:11
  • Try decorating your literals with L"...", as these should be Unicode UTF-16 strings (e.g. L"[" instead of "["). – Mr.C64 Aug 11 '17 at 09:56
  • @PaulMeems To be more clear: `(*result)[i].expression = L"[" + fieldName + L"] = \"" + strValue + L"\"";`. Note how all the string literals are expressed using the `L"..."` format, which denotes Unicode UTF-16 strings. – Mr.C64 Aug 11 '17 at 18:10
  • I've marked your reply as answer, thanks for all your help. I could not properly test it all because after this change in several places of the code I get errors due to the CString --> CStringW modification. I tried solving some but in more and more places errors appear. So I need to take some time and try to solve this step by step. For now thanks for all the help. And when you have some free time, please have a look at the code at GitHub. Perhaps you can spot some more improvements. – Paul Meems Aug 14 '17 at 19:41
  • Glad to be of service. You can ask separate question(s) on the other errors you got in your code. – Mr.C64 Aug 16 '17 at 06:57
1

You won't get a always working version without converting your project into a Unicode aware application.

In other words, to support all chars that may part inside a BSTR you need a Unicode CString (CStringW)

You may stay with an MBCS version but in this case you still have to handle Unicode. Using CStringW may be an option here.

Converting to UTF-8 is done with WideCHarToMultiByte

xMRi
  • 14,982
  • 3
  • 26
  • 59
  • I tried using `CStringW temp = val->bstrVal` and it seems to work. Can I safely change my variable from CString to CStringW? – Paul Meems Aug 03 '17 at 15:36
  • Yes. Why not? What do you expect. Converting CSrintgW to UTF-8 ist done with MultiByteToWideChar – xMRi Aug 03 '17 at 16:10
  • Sorry for more newbie questions. I updated my post by adding my whole switch statement. When I change to CStringW `strValue.Format("%g", val->dblVal);` doesn't compile anymore. – Paul Meems Aug 03 '17 at 19:54
  • @PaulMeems Try `strValue.Format(L"%g",...` with CStringW. The `L` prefix generates a Unicode UTF-16 string literal. – Mr.C64 Aug 03 '17 at 21:55
0

How to: Convert Between Various String Types
https://learn.microsoft.com/en-us/cpp/text/how-to-convert-between-various-string-types

This topic demonstrates how to convert various Visual C++ string types into other strings. The strings types that are covered include char , wchar_t, _bstr_t, CComBSTR, CString, basic_string, and System.String. In all cases, a copy of the string is made when converted to the new type. Any changes made to the new string will not affect the original string, and vice versa.

caoanan
  • 554
  • 5
  • 20