Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

defines abstract CodePoints and their interactions. It also defines multiple s for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

  • (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
  • Used only for international domain names. (historical contenders were utf-5 and utf-6)
  • GB18030 is the official chinese encoding.
  • UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
  • This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

  • () Early adopters who embraced when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
  • (identical to ucs4 aka modern ) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

857 questions
4
votes
1 answer

Is utf-8 null the same as utf-16/utf-32 null?

Does one byte of zeros mean null in utf16 and utf32? as in utf8 or do we need 2 and 4 bytes of zeros to create null in utf16 and utf32 correspondingly?
Mackan
  • 41
  • 2
4
votes
3 answers

Python read from file and remove non-ascii characters

I have the following program that reads a file word by word and writes the word again to another file but without the non-ascii characters from the first file. import unicodedata import codecs infile =…
user1894963
  • 635
  • 3
  • 11
  • 18
4
votes
4 answers

How to convert (not necessarily programmatically) between Windows' wchar_t and GCC/Linux one?

Suppose I have this Windows wchar_t string: L"\x4f60\x597d" and L"\x00e4\x00a0\x597d" and would like to convert it (not necessarily programmatically; it will be a one-time thing) to GCC/Linux wchar_t format, which is UTF-32 AFAIK. How do I do it?…
Paweł Hajdan
  • 18,074
  • 9
  • 49
  • 65
4
votes
4 answers

How to print degree symbol on the window using qt5(QtQuick 2.1) and above

When I was using up to qt4.8(qt quick 1.1) for gui then I am successfully able to print degree with \260 but when things got upgraded to qt5 and above then this stopped working. I searched on the net and found many relevant link such as…
zeal
  • 465
  • 2
  • 11
  • 22
4
votes
1 answer

Emacs Not Displaying Unicode on Reload

When I insert an — (em dash) into a text file, Emacs initially displays it fine. When I reload Emacs, all instances of — are displayed as \342\200\224. How can I get Emacs to display the characters as it did initially? I'm using Windows 7 and Emacs…
Simon Morgan
  • 2,018
  • 5
  • 23
  • 36
4
votes
2 answers

How to determine the length of a CLOB (in bytes) using the AL32UTF character set in Oracle?

Is there is a way to find out the number of bytes used by a particular field value (which may or may not be longer than 4000 characters) in an Oracle SQL query? dbms_lob.getLength() returns the number of characters not bytes and I can't just do a…
Steve Chambers
  • 37,270
  • 24
  • 156
  • 208
4
votes
2 answers

Handling UTF filenames in Python

I've read quite a bit on the topic already, including what seems to be the definitive guide on this topic here: http://docs.python.org/howto/unicode.html Perhaps for a more experienced developer, that guide may be enough. However, in my case, I'm…
4
votes
1 answer

Working with UTF8

It seems like a rather complicated issue to work with std::string and UTF8 and I cannot find a good explanation of do's and dont's. How can I properly work with UTF8 in C++? It is rather confusing. I've found boost::locale and I set the global…
ronag
  • 49,529
  • 25
  • 126
  • 221
4
votes
1 answer

T-SQL code for converting nvarchar string to UTF-8 (for URL percent-encoding)

I need to generate an URL string for a SSRS report (in order to link it with our CRM software). The report name is in Hebrew. When I send the URL string (with Heb) to Internet Explorer, it doesn't recognize the address because it isn't encoded with…
Ido Gal
  • 528
  • 10
  • 26
3
votes
3 answers

HTML Unicode Issue: How to display special characters

Currently, I have my webpage set to Unicode/UTF-8. When trying to display a special character (for example, em dash, double arrow, etc), it shows up as a question mark symbol. I cannot change these characters to the HTML entity equivalent. How can…
user1148809
3
votes
2 answers

Illegal Character on Jenkins Server

I had now several Problems with my Jenkins Build Server and i dont know where they come from... I'm getting this error message: illegal character: \65279 which seems like to be UTF16-BOM. When i open the corresponding file with a HEX Editor, i cant…
reox
  • 5,036
  • 11
  • 53
  • 98
3
votes
2 answers

How to save a Chinese character in MySQL

I am unable to save the character on mySQL 5.5. I have tried collation utf8mb4 and utf32. I have to store both Chinese and English characters in the same table.
geoaxis
  • 1,480
  • 6
  • 25
  • 46
3
votes
2 answers

SQL Server (SQLCMD), Python and encoding issue when using non ascii chars

i'm facing an encoding issue with my python code, when asking data that are in SQL Server 2005. (because i was unable to compile PyMSSQL-2.0.0b1) i'm using this piece of code and i am able to do some select but now i stick with the issue that i do…
lctv31
  • 119
  • 3
  • 12
3
votes
2 answers

How to validate a utf sequence in PHP?

After converting my site to use utf-8, I'm now faced with the prospect of validating all incoming utf data, to ensure its valid and coherent. There seems to be various regexp's and PHP API to detect whether a string is utf, but the ones Ive seen…
carpii
  • 1,917
  • 4
  • 20
  • 24
3
votes
4 answers

How to Store Emojis in char8_t and Print Them Out in C++20?

I just now heard about the existence of char8_t, char16_t and char32_t and I am testing it out. When I try to compile the code below, g++ throws the following error: error: use of deleted function ‘std::basic_ostream&…
Sheldon
  • 376
  • 3
  • 14