Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

defines abstract CodePoints and their interactions. It also defines multiple s for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

  • (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
  • Used only for international domain names. (historical contenders were utf-5 and utf-6)
  • GB18030 is the official chinese encoding.
  • UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
  • This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

  • () Early adopters who embraced when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
  • (identical to ucs4 aka modern ) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

857 questions
11
votes
6 answers

What is the Best UTF

I'm really confused about UTF in Unicode. there is UTF-8, UTF-16 and UTF-32. my question is : what UTF that are support all Unicode blocks ? What is the best UTF(performance, size, etc), and why ? What is different between these three UTF ? what is…
Ahmad
  • 4,224
  • 8
  • 29
  • 40
11
votes
4 answers

Is there any reason not to use UTF-8, 16, etc. for everything?

I know the web is mostly standardizing towards UTF-8 lately and I was just wondering if there was any place where using UTF-8 would be a bad thing. I've heard the argument that UTF-8, 16, etc may use more space but in the end it has been…
Joe Phillips
  • 49,743
  • 32
  • 103
  • 159
11
votes
2 answers

PDFBox U+00A0 is not available in this font's encoding

I am facing a problem when invoking the setValue method of a PDField and trying to set a value which contains special characters. field.setValue("TEST-BY  (TEST)") In detail, if my value contains characters as U+00A0 i am getting the following…
assuna
  • 135
  • 1
  • 1
  • 8
11
votes
4 answers

SQL doesnt differentiate u and ü although collation is utf8mb4_unicode_ci

In a table x, there is a column with the values u and ü. SELECT * FROM x WHERE column='u'. This returns u AND ü, although I am only looking for the u. The table's collation is utf8mb4_unicode_ci . Wherever I read about similar problems, everyone…
Jakob
  • 111
  • 1
  • 5
11
votes
2 answers

Content is not allowed in prolog

i'm trying to convert xml to html using xslt. Am using java.xml.transform to do this in java. it was working fine until i bumped into some xml. it said the following error. [Fatal Error] :1:1: Content is not allowed in prolog. …
Senthil Kumar
  • 9,695
  • 8
  • 36
  • 45
11
votes
2 answers

How can I put a , or any other emoji inside an XML string?

How can I do this? I'm pretty new to Java and Android and I have the problem described above. When I paste the emoji inside the xml file it shows a white square and another weird character which "copies" the next character. Any idea on how to work…
Donfo
  • 113
  • 1
  • 1
  • 5
11
votes
2 answers

MongoDB special characters

I have inserted a init file into MongoDB: db.User.insert({ "_id" : ObjectId("5589929b887dc1fdb501cdba"), "_class" : "com.smartinnotec.aposoft.dao.domain.User", "title" : "DI.", ... "address" : { "_id" : null, ... "country" : "Österreich" }}) And…
quma
  • 5,233
  • 26
  • 80
  • 146
11
votes
2 answers

What most correct way to set the encoding in C++?

How it is best of all to set the encoding in C++? I got used to working with Unicode (and wchar_t, wstring, wcin, wcout and L" ... "). I also save source in UTF-8. At the moment I use MinGW (Windows 7) and run my program in Windows console…
shau-kote
  • 1,110
  • 3
  • 12
  • 24
10
votes
3 answers

UTF conversion functions in C++11

I'm looking for a collection of functions for performing UTF character conversion in C++11. It should include conversion to and from any of utf8, utf16, and utf32. A function for recognizing byte order marks would be helpful, too.
Brent
  • 4,153
  • 4
  • 30
  • 63
10
votes
4 answers

What is a surrogate pair?

I came across this code in a javascript open source project. validator.isLength = function (str, min, max) // match surrogate pairs in string or declare an empty array if none found in string var surrogatePairs =…
Noman Ur Rehman
  • 6,707
  • 3
  • 24
  • 39
9
votes
2 answers

UTF Encoding for Chinese CharactersJava

I am receiving a String via an object from an axis webservice. Because I'm not getting the string I expected, I did a check by converting the string into bytes and I get C3A4C2 BDC2A0 C3A5C2 A5C2BD C3A5C2 90C297 in hexa, when I'm expecting E4BDA0…
Maurice
  • 6,413
  • 13
  • 51
  • 76
9
votes
1 answer

What are surrogate characters in UTF-8?

I have a strange validation program that validates wheather a utf-8 string is a valid host name(Zend Framework Hostname valdiator in PHP). It allows IDNs(internationalized domain names). It will compare each subdomain with sets of characters defined…
Gherman
  • 6,768
  • 10
  • 48
  • 75
9
votes
3 answers

UTF-8 Encoding ; Only some Japanese characters are not getting converted

I am getting the parameter value as parameter from the Jersey Web Service, which is in Japaneses characters. Here, 'japaneseString' is the web service parameter containing the characters in japanese language. String name = new…
Janak
  • 4,986
  • 4
  • 27
  • 45
9
votes
2 answers

Difference between UTF encodings?

I have a simple question - what is the difference between UTF-8, UTF-16 and UTF-32? I know that encoded strings have different sizes, but what is the UTF-16 and UTF-32 for?Should't UTF-8 be able to handle all languages correctly? And how does UTF-7…
Petr Mensik
  • 26,874
  • 17
  • 90
  • 115
8
votes
4 answers

jsp utf encoding

I'm having a hard time figuring out how to handle this problem: I'm developing a web tool for an Italian university, and I have to display words with accents (such as è, ù, ...); sometimes I get these words from a PostgreSql table (UTF8-encoded),…
nicolamontecchio
  • 103
  • 1
  • 1
  • 6
1 2
3
57 58