Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

unicode defines abstract CodePoints and their interactions. It also defines multiple encodings for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

utf-8 (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
punycode Used only for international domain names. (historical contenders were utf-5 and utf-6)
GB18030 is the official chinese encoding.
UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
utf-7 This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

utf-16 (utf-16le) Early adopters who embraced ucs2 when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
utf-32 (identical to ucs4 aka modern ucs) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

Wikipedia on Unicode

857 questions

votes

1 answer

Is utf-8 null the same as utf-16/utf-32 null?

Does one byte of zeros mean null in utf16 and utf32? as in utf8 or do we need 2 and 4 bytes of zeros to create null in utf16 and utf32 correspondingly?

unicode null utf

asked Apr 21 '10 at 18:55

Mackan

votes

3 answers

Python read from file and remove non-ascii characters

I have the following program that reads a file word by word and writes the word again to another file but without the non-ascii characters from the first file. import unicodedata import codecs infile =…

python encoding character-encoding utf

asked Oct 14 '14 at 19:44

user1894963

votes

4 answers

How to convert (not necessarily programmatically) between Windows' wchar_t and GCC/Linux one?

Suppose I have this Windows wchar_t string: L"\x4f60\x597d" and L"\x00e4\x00a0\x597d" and would like to convert it (not necessarily programmatically; it will be a one-time thing) to GCC/Linux wchar_t format, which is UTF-32 AFAIK. How do I do it?…

c++ linux utf wchar-t

asked Oct 25 '08 at 08:55

Paweł Hajdan

18,074
9
49
65

votes

4 answers

How to print degree symbol on the window using qt5(QtQuick 2.1) and above

When I was using up to qt4.8(qt quick 1.1) for gui then I am successfully able to print degree with \260 but when things got upgraded to qt5 and above then this stopped working. I searched on the net and found many relevant link such as…

c++ c qt qml utf

asked Sep 12 '13 at 07:37

zeal

votes

1 answer

Emacs Not Displaying Unicode on Reload

When I insert an — (em dash) into a text file, Emacs initially displays it fine. When I reload Emacs, all instances of — are displayed as \342\200\224. How can I get Emacs to display the characters as it did initially? I'm using Windows 7 and Emacs…

windows emacs unicode utf

asked Jul 08 '13 at 19:47

Simon Morgan

2,018
5
23
36

votes

2 answers

How to determine the length of a CLOB (in bytes) using the AL32UTF character set in Oracle?

Is there is a way to find out the number of bytes used by a particular field value (which may or may not be longer than 4000 characters) in an Oracle SQL query? dbms_lob.getLength() returns the number of characters not bytes and I can't just do a…

sql oracle byte clob utf

asked Nov 27 '12 at 15:53

Steve Chambers

37,270
24
156
208

votes

2 answers

Handling UTF filenames in Python

I've read quite a bit on the topic already, including what seems to be the definitive guide on this topic here: http://docs.python.org/howto/unicode.html Perhaps for a more experienced developer, that guide may be enough. However, in my case, I'm…

python windows filenames utf

asked Jul 18 '12 at 15:42

user1535316

votes

1 answer

Working with UTF8

It seems like a rather complicated issue to work with std::string and UTF8 and I cannot find a good explanation of do's and dont's. How can I properly work with UTF8 in C++? It is rather confusing. I've found boost::locale and I set the global…

c++ string boost locale utf

asked Jun 10 '12 at 10:14

ronag

49,529
25
126
221

votes

1 answer

T-SQL code for converting nvarchar string to UTF-8 (for URL percent-encoding)

I need to generate an URL string for a SSRS report (in order to link it with our CRM software). The report name is in Hebrew. When I send the URL string (with Heb) to Internet Explorer, it doesn't recognize the address because it isn't encoded with…

sql tsql reporting-services uri utf

asked Apr 08 '12 at 14:02

Ido Gal

votes

3 answers

HTML Unicode Issue: How to display special characters

Currently, I have my webpage set to Unicode/UTF-8. When trying to display a special character (for example, em dash, double arrow, etc), it shows up as a question mark symbol. I cannot change these characters to the HTML entity equivalent. How can…

html unicode character-encoding utf

asked Jan 19 '12 at 07:02

user1148809

votes

2 answers

Illegal Character on Jenkins Server

I had now several Problems with my Jenkins Build Server and i dont know where they come from... I'm getting this error message: illegal character: \65279 which seems like to be UTF16-BOM. When i open the corresponding file with a HEX Editor, i cant…

java jenkins utf byte-order-mark

asked Jan 17 '12 at 10:29

reox

5,036
11
53
98

votes

2 answers

How to save a Chinese character in MySQL

I am unable to save the character on mySQL 5.5. I have tried collation utf8mb4 and utf32. I have to store both Chinese and English characters in the same table.

mysql utf-8 utf utf8mb4

asked Nov 22 '11 at 17:22

geoaxis

1,480
6
25
46

votes

2 answers

SQL Server (SQLCMD), Python and encoding issue when using non ascii chars

i'm facing an encoding issue with my python code, when asking data that are in SQL Server 2005. (because i was unable to compile PyMSSQL-2.0.0b1) i'm using this piece of code and i am able to do some select but now i stick with the issue that i do…

python sql-server-2005 encoding sqlcmd utf

asked Nov 03 '11 at 10:57

lctv31

votes

2 answers

How to validate a utf sequence in PHP?

After converting my site to use utf-8, I'm now faced with the prospect of validating all incoming utf data, to ensure its valid and coherent. There seems to be various regexp's and PHP API to detect whether a string is utf, but the ones Ive seen…

php utf-8 utf

asked Oct 23 '11 at 21:39

carpii

1,917
4
20
24

votes

4 answers

How to Store Emojis in char8_t and Print Them Out in C++20?

I just now heard about the existence of char8_t, char16_t and char32_t and I am testing it out. When I try to compile the code below, g++ throws the following error: error: use of deleted function ‘std::basic_ostream&…

c++ utf-8 c++20 emoji utf

asked Feb 26 '23 at 21:52

Sheldon

Prev 1 2 3

…

57 58 Next