Required to convert a String to UTF8 string

Question

Problem Statement: I am required to convert a generated string to UTF8 string, this generated string has extended ascii characters and I am on Linux system (2.6.32-358.el6.x86_64).

A POC is still in progress so I can only provide small code samples and complete solution can be posted only once ready.

Why I required UFT8 (I have extended ascii characters to be stored in a string which has to be UTF8).

How I am proceeding:

Convert generated string to wchar_t string.

Please look at the below sample code

int main(){
  char  CharString[] = "Prova";
  iconv_t cd;
  wchar_t  WcharString[255];

  size_t size= mbstowcs(WcharString, CharString, strlen(CharString));

  wprintf(L"%ls\n", WcharString);

  wprintf(L"%s\n", WcharString);

  printf("\n%zu\n",size);
}

One question here:

Output is

Prova?????

s

Why the size is not printed here ?
Why the second printf prints only one character.
If I print size before both printed string then only 5 is printed and both strings are missing from console.

Moving on to Second Part:

Now that I will have a wchar_t string I want to convert it to UTF8 string

For this I was surfing through and found iconv will help here.

Question here These are the methods I found in manual

**iconv_t iconv_open(const char *, const char *);

size_t  iconv(iconv_t, char **, size_t *, char **, size_t *);

int     iconv_close(iconv_t);**

Do I need to convert back wchar_t array to char array to before feeding to iconv ?

Please provide suggestions on the above issues.

Extended ascii I am talking about please see letters i in the marked snapshot below enter image description here

isn't `wprintf(L"s\n", WcharString);` should be `wprintf(L"%s\n", CharString);` or something? — Sourav Ghosh, Jun 18 '15 at 14:46
Partially related: on Linux virtually nobody uses `wchar_t`, but all strings are normally narrow-strings (`char *`) encoded in UTF-8; are you explicitly choosing to use `wchar_t` (if so, why?) or it's mandated by some library you are using? — Matteo Italia, Jun 18 '15 at 14:47
What do you mean by the *extended ASCII* you have as input? Is it an array of bytes where each byte correspond to a character, even for values 128 and above? Why would you want to use `wchar_t` then? A simple table with the corresponding UTF-8 byte sequence for the 256 entries and you're done. — Didier Trosset, Jun 18 '15 at 14:48
@SouravGhosh Yes missed it should I be putting screen shots here , — Sanyam Goel, Jun 18 '15 at 14:57
I dont know why peiople are more interested in putting a negative here. If you can help try to. If you can't mind your own business then in place of playing with negative positive arrows — Sanyam Goel, Jun 18 '15 at 14:57
@SanyamGoel no, text code is better. and don't take the votes personally, you're not new here, so, you should be knowing.... — Sourav Ghosh, Jun 18 '15 at 14:58
@n.m.: actually, the real problem is that there are *too many* "extended ASCII"s. :-) — Matteo Italia, Jun 18 '15 at 14:59
@Didier Trosset Please look at this string , This is the kind of string I am required to convert to a UTF8 . Please check in attached snapshot the two letters i , these are the ones — Sanyam Goel, Jun 18 '15 at 15:08
@n.m. https://en.wikipedia.org/?title=Extended_ASCII Please help wiki remove this page — Sanyam Goel, Jun 18 '15 at 15:11
Are you sure those are not 8 bit extended ASCII? Check codes 204-207 — imreal, Jun 18 '15 at 15:23
The extended ascii are from The extended ASCII codes (character code 128-255) http://www.ascii-code.com/ — Sanyam Goel, Jun 18 '15 at 15:25
@imreal I didnt get 202 and all are not extended ascii . Please correct me — Sanyam Goel, Jun 18 '15 at 15:30
The most basic form of extended ASCII is the codes from 127-255 i.e. use of the 8th bit. The characters in your snapshot, are in 204-207. It is transparent to the language (on 8 bit char platforms) to represent those values in a simple `char`. Hence you don't need UTF-8, did I misunderstand your question? — imreal, Jun 18 '15 at 15:33
@imreal , Ok Ome more thing here We are working on Linux system. This snapshot was taken on windows machine. When viewing the log in less command I see and in place of i's . What does this mean ? Why the character representation is not available here? — Sanyam Goel, Jun 18 '15 at 15:39
Actually those codes might be wrong, they might be 214-216, check http://www.theasciicode.com.ar/extended-ascii-code/capital-letter-a-acute-accent-ascii-code-181.html — imreal, Jun 18 '15 at 15:42
@imreal, I am pretty sure I am telling the problem correctly. And those who are doubting have have put a negative did not even try to answer . There are so many questions No one bothered to answer I need help , I am not to debate here , neither this is a college assignment please mind. After so much of search I have posted here and see the result people are debating did anyone read this entirely no idea. — Sanyam Goel, Jun 18 '15 at 15:47
And this was not for you imreal I am addressing those who are bothered to debate then to answer , This problem I am putting here is because I am getting no clues on this else I have hell lot of work to do for the entire day in place of posting stupid question son stack for reputation :) — Sanyam Goel, Jun 18 '15 at 15:47
@Sourav Ghosh Now that the problem was not that missing % can you please give your expert advice on the behavior? — Sanyam Goel, Jun 18 '15 at 15:54
Your link, second sentence: *The use of the term is sometimes criticized, because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, both of which are untrue*. See? I'm criticizing the term, and Wikipedia reflects that. — n. m. could be an AI, Jun 19 '15 at 18:25
@n.m.No debates here. Taken your points . if you feel like updating anything do suggest. If you look at the below answer, It is good enough to help me and all others who were bragging out for a debate then to even answer all the questions :). How rici has explained below is I feel only a experienced person could have done. This is what I was expecting, not a debate. Any ways thanks for your help too — Sanyam Goel, Jun 20 '15 at 08:46

rici · Accepted Answer · 2015-06-18T16:47:39.110

For your first question (which I am interpreting as "why is all the output not what I expect"):

Where does the '?????' come from? In the call mbstowcs(WcharString, CharString, strlen(CharString)), the last argument (strlen(CharString)) is the length of the output buffer, not the length of the input string. mbstowcs will not write more than that number of wide characters, including the NUL terminator. Since the conversion requires 6 wide characters including the terminator, and you are only allowing it to write 5 wide characters, the resulting wide character string is not NUL terminated, and when you try to print it out you end up printing garbage after the end of the converted string. Hence the ?????. You should use the size of the output buffer in wchar_t's (255, in this case) instead.
Why does the second wprintf only print one character? When you call wprintf with a wide character string argument, you must use the %ls format code (or, more accurately, the %s conversion needs to be qualified with an l length modifier). If you use %s without the l, then wprintf will interpret the string as a char*, and it will convert each character to a wchar_t as it outputs it. However, since the argument is actually a wide character string, the first wchar_t in the string is L"p", which is the number 0x70 in some integer size. That means that the second byte of the wchar_t (counting from the end, since you have a little-endian architecture) is a 0, so if you treat the string as a string of characters, it will be terminated immediately after the p. So only one character is printed.
Why doesn't the last printf print anything? In C, an output stream can either be a wide stream or a byte stream, but you don't specify that when you open the stream. (And, in any case, standard output is already opened for you.) This is called the orientation of the stream. A newly opened stream is unoriented, and the orientation is fixed when you first output to the stream. If the first output call is a wide call, like wprintf, then the stream is a wide stream; otherwise, it is a byte stream. Once set, the orientation is fixed and you can't use output calls of the wrong orientation. So the printf is illegal, and it does nothing other than raise an error.

Now, let's move on to your second question: What do I do about it?

The first thing is that you need to be clear about what format the input is in, and how you want to output it. On Linux, it is somewhat unlikely that you will want to use wchar_t at all. The most likely cases for the input string are that it is already UTF-8, or that it is in some ISO-8859-x encoding. And the most likely cases for the output are the same: either it is UTF-8, or it is some ISO-8859-x encoding.

Unfortunately, there is no way for your program to know what encoding the console is expecting. The output may not even be going to a console. Similarly, there is really no way for your program to know which ISO-8859-x encoding is being used in the input string. (If it is a string literal, the encoding might be specified when you invoke the compiler, but there is no standard way of providing the information.)

If you are having trouble viewing output because non-ascii characters aren't displaying properly, you should start by making sure that the console is configured to use the same encoding as the program is outputting. If the program is sending UTF-8 to a console which is displaying, say, ISO-8859-15, then the text will not display properly. In theory, your locale setting includes the encoding used by your console, but if you are using a remote console (say, through PuTTY from a Windows machine), then the console is not part of the Linux environment and the default locale may be incorrect. The simplest fix is to configure your console correctly, but it is also possible to change the Linux locale.

The fact that you are using mbstowcs from a byte string suggests that you believe that the original string is in UTF-8. So it seems unlikely that the problem is that you need to convert it to UTF-8.

You can certainly use iconv to convert a string from one encoding to another; you don't need to go through wchar_t to do so. But you do need to know the actual input encoding and the desired output encoding.

Thank you so much . This is what I was expecting a person who could tell me in depth and elaborate, Your last suggestion seems to be very nice . I will do that and post here whatever happens. and yes you caught it correct I am connecting through putty :) Thank you once again . — Sanyam Goel, Jun 18 '15 at 17:12

score 1 · Answer 2 · answered Jun 18 '15 at 14:49

1

It's no good idea to use iconv for utf8. Just implement the definition of utf8 yourself. That is quite easily in done in C from the Description https://en.wikipedia.org/wiki/UTF-8. You don't even need wchar_t, just use uint32_t for your characters. You will learn much if you implement yourself and your program will gain speed from not using mb or iconv functions.

answered Jun 18 '15 at 14:49

ikrabbe

1,909
12
25

1

To output the string, OP still is going to need `wchar_t` (where – hey! again, it's important but not mentioned in the post! – I'm guessing he's on Windows). But I agree that UTF8 is extremely simple to implement. – Jongware Jun 18 '15 at 14:54
1

Most people think UTF8 is some kind of magic, but it's just a simple encoding that leaves 7bit ascii alone and encodes the rest of the 24bit unicode values. If you really need wchar_t for output depends on the encoding you want to encode to. But if you ask me I would simply throw away any system that still uses legacy character tables. With gcc I think wchar_t is defined just typedef'd as `int` on x86_84 – ikrabbe Jun 18 '15 at 17:11

Required to convert a String to UTF8 string

2 Answers2