Declaring a std::string after Unicode to ASCII conversion is giving Segmentation fault

Question

I am trying to take a wchar_t string from stdin and then convert it from unicode to ASCII through a function.

The function is somehow not allowing me to use std::string further in the program.

#include <iostream>
#include <string>
#include <locale>
#include <cstring>
#include <cwchar>
using namespace std;
bool UnicodeToAscii(wchar_t* szUnicode, char* szAscii);
int main()
{
    wchar_t w[100];
    wcin>>w;
    char* c;
    bool x=UnicodeToAscii(w,c);
    cout<<c<<"\n";
    string s="hi";
    return 0;
}
bool UnicodeToAscii(wchar_t* szUnicode, char* szAscii)
{
    int len, i;
    if((szUnicode == NULL) || (szAscii == NULL))
        return false;
    len = wcslen(szUnicode);
    for(i=0;i<len+1;i++)
        *szAscii++ = static_cast<char>(*szUnicode++);
    return true;
}

You are **not** converting any Unicode encoding to ASCII here. Unicode is far more complex. — deviantfan, Apr 05 '15 at 18:24
Other than that, your test input would help. It could be the reason. — deviantfan, Apr 05 '15 at 18:25
Ok, that might be wrong but I tried to typecast a **wchar_t** to **char** in the function and then I am unable to use std::string in my program. This is quite strange.. — Raghav Somani, Apr 05 '15 at 18:27
No test input is working.. As soon as i enter a word and press enter, I get a Segmentation fault error — Raghav Somani, Apr 05 '15 at 18:27
...where do you get memory for `char* c;` ? I don´t see anything related to that. (and the check if it is NULL in the function is good, but not enough. It doesn´t need to be NULL if you´ve done nothing at all). And to know how much memory you´ll need, you need to understand the encoding better first... — deviantfan, Apr 05 '15 at 18:29
you mean adding these 2 lines? int len=wcslen(w); char* c=new char[len]; — Raghav Somani, Apr 05 '15 at 18:32
Depends on the actual encoding and the impplementation of wcslen. And you need a `delete` too if you´re using `new`. ... Please stop now trying to make this function and read something (many things) about Unicode. — deviantfan, Apr 05 '15 at 18:33
Except you can´t be sure and you have a memory leak. Great. — deviantfan, Apr 05 '15 at 18:35
...and there is still the issue that you maybe have the correct length, but a wrong conversion. *Your code is still wrong, even if you won´t believe me. And depending on the input, it will still crash or make something strange.* — deviantfan, Apr 05 '15 at 18:39
May I pls know more about it? I need to implement this method of conversion into a bigger program and I may not like such strange behaviours in it — Raghav Somani, Apr 05 '15 at 18:41
I can´t write a (thick) book just now. Unicode is *really* far more complex. UTF/UCS? 8, 16, 32? big/little endian? Recognizing the 7bit ASCII part in the multiple bytes of one char? Filtering out additional stuff like extra codepoints for accents? Unifying different kinds of whitespaces and numbers? and many more... — deviantfan, Apr 05 '15 at 18:51
If you need to do this for school, tell your teacher he´s an idiot (sorry). If you need this for yourself, use LibICU (but even knowing enough to understand it´s documentation will need some time and reading) — deviantfan, Apr 05 '15 at 18:51
Even the zipped binary-only versions of ICU have more than 10MB, Nobody can write this in one evening — deviantfan, Apr 05 '15 at 19:00

score 2 · Accepted Answer · answered Apr 06 '15 at 20:34

You are not allocating any memory for c, so you are writing character data to random memory and corrupting your program.

You should stop using character arrays and raw pointers, and start using std::string and std::wstring instead. Let them manage memory for you.

Try this:

#include <iostream>
#include <string>

void UnicodeToAscii(const std::wstring &szUnicode, std::string &szAscii);

int main()
{
    std::wstring w;
    std::wcin >> w; // or std::getline(wcin, w);

    std::string c;
    bool x = UnicodeToAscii(w, c);
    std::cout << c << "\n";

    std::string s = "hi";
    return 0;
}

void UnicodeToAscii(const std::wstring &szUnicode, std::string &szAscii)
{
    szAscii.clear(len);

    int len = szUnicode.length();
    char c;

    szAscii.reserve(len);

    for(int i = 0; i < len; ++i)
    {
        wchar_t w = szUnicode[i];

        if ((w >= 0) && (w < 127))
        {
            // ASCII character
            c = static_cast<char>(w);
        }
        else
        {
            // non-ASCII character
            c = '?';

            // wchar_t is 2 bytes (UTF-16) on some systems,
            // but is 4 bytes (UTF-32) on other systems...
            #if sizeof(wchar_t) == 2
            if ((w >= 0xD800) && (w <= 0xDFFF))
            {
                // skip first unit of a surrogate pair,
                // the loop will skip the second unit...
                ++i;
            }
            #endif
        }

        szAscii.push_back(c);
    }

    return true;
}

Of course, this is very rudimentary, and it only handles true ASCII characters (0x00 - 0x7F). Handling Unicode correctly is much more complex than this. But this answers your immediate question about why you cannot use std::string after calling your function - because you are trashing memory.

Adam Leggett · Answer 2 · 2015-04-05T19:17:20.533

-1

You never allocate memory for c before writing to the invalid pointer.
It's unsafe to cin >> to a fixed size array. You might consider std::wstring.
If you want to convert 16-bit characters to 8-bit characters, use UTF-8 encoding in the 8-bit string, not ASCII. If you must use ASCII, you will have to error out if any character is out of range, or else replace it with a placeholder character. However, this leaves you without international support. You should be able to find information on converting UTF-16 to UTF-8 in C++ easily.

edited Apr 05 '15 at 19:17

answered Apr 05 '15 at 18:55

Adam Leggett

3,714
30
24

Three things: While it´s true that converting UTF16 to UTF8 is far easier than OPs original plan, who even said that his source data is UTF16? And who said he can use UTF8 for whatever he needs the converted data? And ... `converting Unicode to UTF-8` please not. UTF8 *is* Unicode, just not all of it. – deviantfan Apr 05 '15 at 18:57
I suggested three options for storing the converted data in an 8-bit format. I'm not aware of any additional options. – Adam Leggett Apr 05 '15 at 19:01
While I don´t see *three* different conversion methods in your answer, it doesn´t matter because it doesn´t answer the question. (And, a fourth thing: OP doesn´t use cin) – deviantfan Apr 05 '15 at 19:03
1

@deviantfan The first bullet point does answer the question as the problem is that OP did not allocate memory for `c`. – M.M Apr 09 '15 at 00:45

Declaring a std::string after Unicode to ASCII conversion is giving Segmentation fault

2 Answers2