0

I am trying to convert cp1251 text to utf-8. What I am doing here is creating the buffer of hex numbers of given symbols in cp1251 to later convert those hex symbols to utf-8. The problem is that sometimes the converted string has some trash symbols in the end.

The output of converting the same string many times (203 ñòåï ÒÖÍÐ.466219.007 Èíòåðàêòèâíûé êîìïëåêñ NextPanel 43/NAUO1):

theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1омплекс NextPanelВ 43/NAUO1
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO14V
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1sгцf¤№
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1п}тУї0bЊ.Z¶ї¬ЁЌ€/ГїаА Om›Ґї
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1sгцf¤№
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1™™™™Щ?
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1
char *ConvertWindows1251ToUtf8(string stringToConvert)
{
  // string tmpString = stringToConvert.ToCString();

  const char *tmpCharArray = stringToConvert.c_str();

  vector<char> charBuffer;

  char *buffer = new char[char_traits<char>::length(tmpCharArray)];

  int latter = 0;

  for (int i = 0; i < std::char_traits<char>::length(tmpCharArray); i++)
  {
    string tmpHexLatter;
    int hexLatter = 0xFF & tmpCharArray[i];

    stringstream ss;
    ss << hex << hexLatter;

    tmpHexLatter = ss.str();

    if (hexLatter != 0xc3)
    {
      if (hexLatter != 0x30 && hexLatter != 0x31 && hexLatter != 0x32 && hexLatter != 0x33 && hexLatter != 0x34 && hexLatter != 0x35 && hexLatter != 0x36 && hexLatter != 0x37 && hexLatter != 0x38 && hexLatter != 0x39 && hexLatter != 0x61 && hexLatter != 0x62 && hexLatter != 0x63 && hexLatter != 0x64 && hexLatter != 0x65 && hexLatter != 0x66 && hexLatter != 0x67 && hexLatter != 0x68 && hexLatter != 0x69 && hexLatter != 0x6A && hexLatter != 0x6B && hexLatter != 0x6C && hexLatter != 0x6D && hexLatter != 0x6E && hexLatter != 0x6F && hexLatter != 0x70 && hexLatter != 0x71 && hexLatter != 0x72 && hexLatter != 0x73 && hexLatter != 0x74 && hexLatter != 0x75 && hexLatter != 0x76 && hexLatter != 0x77 && hexLatter != 0x78 && hexLatter != 0x79 && hexLatter != 0x7A && hexLatter != 0x41 && hexLatter != 0x42 && hexLatter != 0x43 && hexLatter != 0x44 && hexLatter != 0x45 && hexLatter != 0x46 && hexLatter != 0x47 && hexLatter != 0x48 && hexLatter != 0x49 && hexLatter != 0x4A && hexLatter != 0x4B && hexLatter != 0x4C && hexLatter != 0x4D && hexLatter != 0x4E && hexLatter != 0x4F && hexLatter != 0x50 && hexLatter != 0x51 && hexLatter != 0x52 && hexLatter != 0x53 && hexLatter != 0x54 && hexLatter != 0x55 && hexLatter != 0x56 && hexLatter != 0x57 && hexLatter != 0x58 && hexLatter != 0x59 && hexLatter != 0x5A && hexLatter != 0x2E)
        hexLatter += 64;

      if (hexLatter == 0x60)
        hexLatter = 0xA0;

      if (hexLatter == 0x6F)
        hexLatter = 0x2F;

      stringstream ss;
      ss << hex << hexLatter;
      string tmpHex = ss.str();

      tmpHexLatter = "0x" + tmpHex;

      latter = stoi(tmpHexLatter, {}, 16);

      charBuffer.push_back((char)latter);
    }
  }

  for (int i = 0; i < charBuffer.size(); i++)
  {
    buffer[i] = charBuffer[i];
  }

  return g_convert(buffer, -1, "utf-8", "Windows-1251", NULL, NULL, NULL);

  /*string tmpStr = stringToConvert.ToCString();
  std::unique_ptr<gchar, void (*)(gpointer)> p(g_convert(tmpStr.c_str(), -1, "utf-8", "Windows-1251", NULL, NULL, NULL), g_free);
  return TCollection_AsciiString(p.get());*/
}
XoDefender
  • 33
  • 6
  • 1
    Why don't pass `stringToConvert.c_str()` as the first argument of `g_convert`? – 273K Jul 29 '22 at 06:49
  • Because this it does not convert the text right way. Because of it, I create the hex buffer and pass it – XoDefender Jul 29 '22 at 06:52
  • Thank you a lot! I have added charBuffer.push_back((char)0x00); in the end end it works – XoDefender Jul 29 '22 at 06:59
  • 1
    Why not just use `iconv` to convert it? – Shawn Jul 29 '22 at 06:59
  • Because it does not convert the string correctly. When I passed the string I read from the file to iconv function to convert the text from (cp1251 to utf-8) The text converted into a mess somehow – XoDefender Jul 29 '22 at 07:18

2 Answers2

2

You don't need buffer at all, you can pass charBuffer.data() or even stringToConvert.c_str() to g_convert().

But, more importantly, bothbuffer and charBuffer are not null-terminated, and you are not otherwise passing the final length to g_convert(), so g_convert() will end up either reaching out of bounds, or try to convert uninitialized data, either way leading to undefined behavior, which is why you see garbage on the end of the result.


On a side note, you don't need the "0x" prefix when calling std::stoi() with base=16.

Also, why are you returning a char* instead of a std::string? Who is responsible for allocating and freeing the memory? You really should let std::string handle that.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
2

If you're using linux, you have the iconv(3) API available to convert between character encodings. Unfortunately, it's a C interface so it can be a bit ugly to use from C++, but it's still better than whatever you're trying to do with converting codepoints to hex strings.

I dug out an old wrapper class I wrote once to make iconv easier to use from C++ (C++17 or newer):

#include <cerrno>
#include <cstdlib>
#include <cstring>
#include <fstream>
#include <iostream>
#include <iterator>
#include <memory>
#include <string>
#include <string_view>
#include <iconv.h>

// Wrap iconv API in a RAII-styled class for ease of use
class iconv_wrapper {
private:
  iconv_t cd;

public:
  iconv_wrapper(std::string_view from, std::string_view to = "UTF-8")
      : cd{iconv_open(to.data(), from.data())} {
    if (cd == reinterpret_cast<iconv_t>(-1)) {
      // Need better error handling
      std::cerr << "Unable to open converter from " << from << " to " << to
                << ": " << std::strerror(errno) << '\n';
      std::exit(EXIT_FAILURE);
    }
  }
  ~iconv_wrapper() noexcept { iconv_close(cd); }
  std::string convert(std::string_view);
};

std::string iconv_wrapper::convert(std::string_view input) {
  // Work out the maximum output size (Assuming converting from a
  // single-byte encoding to UTF-8) and allocate a buffer and the
  // other args needed for iconv
  std::size_t insize = input.size();
  std::size_t outsize = insize * 4;
  std::size_t orig_outsize = outsize;
  auto outbuf = std::make_unique<char[]>(outsize);
  char *indata = const_cast<char *>(&input[0]);
  char *outdata = &outbuf[0];

  // Convert the input argument
  std::size_t ret = iconv(cd, &indata, &insize, &outdata, &outsize);
  if (ret == static_cast<std::size_t>(-1)) {
    // Need better error handling
    std::cerr << "Couldn't convert input data: " << std::strerror(errno)
              << '\n';
    std::exit(EXIT_FAILURE);
  }
  // And return it
  return std::string(outbuf.get(), orig_outsize - outsize);
}

int main(int argc, char **argv) {
  if (argc != 2) {
    std::cerr << "Usage: " << argv[0] << " cp1251-encoded-file\n";
    return EXIT_FAILURE;
  }

  std::ifstream in{argv[1], std::ios_base::in | std::ios_base::binary};
  if (!in) {
    std::cerr << "Unable to open " << argv[1] << " for reading!\n";
    return EXIT_FAILURE;
  }

  iconv_wrapper conv{"CP1251"};
  std::string input{std::istreambuf_iterator<char>{in},
                    std::istreambuf_iterator<char>{}};
  std::cout << conv.convert(input);

  return 0;
}
Shawn
  • 47,241
  • 3
  • 26
  • 60
  • `std::ifstream in{argv[1], std::ios_base::in | std::ios_base::binary}; iconv_wrapper conv{"CP1251"}; std::string input{std::istreambuf_iterator{in}, std::istreambuf_iterator{}}; std::cout << conv.convert("Èíòåðàêòèâíûé") << endl; return 0;` The output: **Èíòåðà êòèâíûé** – XoDefender Aug 02 '22 at 08:56
  • The desired output is: **Интерактивный** – XoDefender Aug 02 '22 at 09:00
  • @XoDefender That looks like mojibake; you have UTF-8 encoded text that you're treating as if it were a different, single byte, encoding and then turning those bytes into UTF-8. Which isn't too surprising if you're converting a literal string in your source instead of the contents of a file. – Shawn Aug 02 '22 at 09:39
  • Indeed, È is encoded in UTF-8 as the bytes `C3 88`, and those bytes are the characters Г and € in CP-1251. Don't lie to iconv or anything else about what encoding your text is in and it'll work a lot better. Garbage In, Garbage Out as the old saying goes. – Shawn Aug 02 '22 at 09:46
  • `Èíòåðàêòèâíûé` none of those characters are even in CP-1251 so I have no idea how you expected that to work or where your desired output came from. – Shawn Aug 02 '22 at 09:54
  • Yes, you are right. If I read the text directly from a file, then it converts correctly. If I get you right, all the text I hardcode in string is treated like utf-8, right? So when I hardcoded this string **Èíòåðàêòèâíûé** each symbol was converted to utf-8? – XoDefender Aug 02 '22 at 10:08