Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

defines abstract CodePoints and their interactions. It also defines multiple s for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

  • (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
  • Used only for international domain names. (historical contenders were utf-5 and utf-6)
  • GB18030 is the official chinese encoding.
  • UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
  • This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

  • () Early adopters who embraced when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
  • (identical to ucs4 aka modern ) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

857 questions
31
votes
2 answers

What is the difference between UTF-32 and UCS-4?

What is the difference between UTF-32 and UCS-4 ? Isn't UTF-32 supposed to be a fixed-width encoding ?
Virus721
  • 8,061
  • 12
  • 67
  • 123
28
votes
4 answers

Is there a way in ruby 1.9 to remove invalid byte sequences from strings?

Suppose you have a string like "€foo\xA0", encoded UTF-8, Is there a way to remove invalid byte sequences from this string? ( so you get "€foo" ) In ruby-1.8 you could use Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "€foo\xA0") but that is now deprecated.…
StefanH
  • 413
  • 1
  • 5
  • 7
23
votes
2 answers

What characters do not directly map from Cp1252 to UTF-8?

I've read in several stackoverflow answers that some characters do not directly map (or are even "unmappable") when converting from Cp1252 (aka Windows-1252; they're the same, aren't they?) to UTF-8, e.g. here:…
Christian
  • 6,070
  • 11
  • 53
  • 103
23
votes
12 answers

How to convert php array to utf8?

I have an array: require_once ('config.php'); require_once ('php/Db.class.php'); require_once ('php/Top.class.php'); echo "db"; $db = new Db(DB_CUSTOM); $db->connect(); $res = $db->getResult("select first 1 * from reklamacje"); print_r($res); I…
user2369594
  • 231
  • 1
  • 2
  • 5
20
votes
4 answers

How To Display UTF8 In Netbeans 7?

In my java project, I need to use Arabic text and strings, but the text becomes like "???????" , so what wrong ? and how to resolve this problem? thanks
Radi
  • 6,548
  • 18
  • 63
  • 91
19
votes
4 answers

OSX Emacs: unbind just the right alt?

I'm using emacsformacosx.com and would like to stop the Meta_R (right meta, or right option key) on my Apple keyboard from being an Emacs meta key. The reason is that I want to be able to continue using the right option key as a character modifier…
markhellewell
  • 24,390
  • 1
  • 21
  • 21
18
votes
2 answers

Which Languages Does UTF-8 Not Support?

I'm working on internationalizing one of my programs for work. I'm trying to use foresight to avoid possible issues or redoing the process down the road. I see references for UTF-8, UTF-16 and UTF-32. My question is two parts: What languages does…
James Oravec
  • 19,579
  • 27
  • 94
  • 160
17
votes
2 answers

What is QString::toUtf8 doing?

This may sounds like a obvious question, but I'm missing something about either how UTF-8 is encoded or how the toUtf8 function works. Let's look at a very simple program QString str("Müller"); qDebug() << str << str.toUtf8().toHex(); Then I get…
Johan
  • 20,067
  • 28
  • 92
  • 110
15
votes
4 answers

Strange unicode characters when reading in file in node.js app

I am attempting to write a node app that reads in a set of files, splits them into lines, and puts the lines into an array. Pretty simple. It works on quite a few files except some SQL files that I am working with. For some reason I seem to be…
d512
  • 32,267
  • 28
  • 81
  • 107
14
votes
1 answer

difference between NLS_NCHAR_CHARACTERSET and NLS_CHARACTERSET for Oracle

I would like to know the difference between NLS_NCHAR_CHARACTERSET and NLS_CHARACTERSET settings in Oracle? From my understanding, NLS_NCHAR_CHARACTERSET is for NVARCHAR data types and for NLS_CHARACTERSET would be for VARCHAR2 data types. I tried…
13
votes
4 answers

is PHP str_word_count() multibyte safe?

I want to use str_word_count() on a UTF-8 string. Is this safe in PHP? It seems to me that it should be (especially considering that there is no mb_str_word_count()). But on php.net there are a lot of people muddying the water by presenting their…
carpii
  • 1,917
  • 4
  • 20
  • 24
13
votes
1 answer

What is the efficient, standards-compliant mechanism for processing Unicode using C++17?

Short version: If I wanted to write program that can efficiently perform operations with Unicode characters, being able to input and output files in UTF-8 or UTF-16 encodings. What is the appropriate way to do this with C++? Long version: C++…
Poeta Kodu
  • 1,120
  • 8
  • 16
13
votes
1 answer

What are the limitations of primitive character types in D?

I am currently exploring the specification of the Digital Mars D language, and am having a little trouble understanding the complete nature of the primitive character types. The book Learn to Tango With D is similarly vague on the capabilities and…
Ian Gilham
  • 1,916
  • 3
  • 20
  • 31
12
votes
2 answers

Python psycopg2 not in utf-8

I use Python to connect to my postgresql data base like this: conn=psycopg2.connect(database="fedour", user="fedpur", password="***", host="127.0.0.1", port="5432") No problem for that. But when I make a query and I want to print the cursor I have…
Fedour
  • 367
  • 1
  • 3
  • 18
11
votes
2 answers

Difference between readAsBinaryString and readAsText using FileReader

So as an example, when I read the π character (\u03C0) from a File using the FileReader API, I get the pi character back to me when I read it using FileReader.readAsText(blob) which is expected. But when I use FileReader.readAsBinaryString(blob), I…
gengkev
  • 1,890
  • 2
  • 20
  • 31
1
2
3
57 58