Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

unicode defines abstract CodePoints and their interactions. It also defines multiple encodings for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

utf-8 (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
punycode Used only for international domain names. (historical contenders were utf-5 and utf-6)
GB18030 is the official chinese encoding.
UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
utf-7 This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

utf-16 (utf-16le) Early adopters who embraced ucs2 when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
utf-32 (identical to ucs4 aka modern ucs) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

Wikipedia on Unicode

857 questions

votes

1 answer

can xsd schema validate encoding, e.g. UTF-8, possible?

By using schema, is there any simple/easy way to validate the encoding of an xml msg? Assuming the 1st line of xml is "not" trustworthy? e.g. ignore ?xml version="1.0" encoding="UTF-8" ?

xml utf-8 xsd schema utf

asked Dec 10 '10 at 19:04

lee

votes

3 answers

Why is it necessary to mark continuation bytes in UTF-8?

I've recently been reading up on the UTF-8 variable-width encoding, and I found it strange that UTF-8 specifies the first two bits of every continuation byte to be 10. Range | Encoding -----------------+----------------- 0 - 7f …

unicode utf-8 character-encoding utf

asked Aug 12 '16 at 22:11

crb233

votes

4 answers

Delphi: Encoding Strings as Python do

I want to encode strings as Python do. Python code is this: def EncodeToUTF(inputstr): uns = inputstr.decode('iso-8859-2') utfs = uns.encode('utf-8') return utfs This is very simple. But in Delphi I don't understand, how to encode, to force…

delphi character-encoding encode utf

asked Sep 07 '10 at 11:27

durumdara

3,411
4
43
71

votes

2 answers

How to use AsynchronousFileChannel to read to a StringBuffer efficiently

So you know you can use AsynchronousFileChannel to read an entire file to a String: AsynchronousFileChannel fileChannel = AsynchronousFileChannel.open(filePath, StandardOpenOption.READ); long len = fileChannel.size(); …

java nio utf

asked Oct 18 '15 at 12:47

gotch4

13,093
29
107
170

votes

1 answer

fatal error: high- and low-surrogate code points are not valid Unicode scalar values

Sometimes while initializing a UnicodeScalar with a value like 57292 yields the following error: fatal error: high- and low-surrogate code points are not valid Unicode scalar values What is this error, why does it occur and how can I prevent it in…

string swift unicode utf-16 utf

asked Aug 22 '15 at 16:34

Vatsal Manot

17,695
9
44
80

votes

3 answers

how to determine text encoding

I know UTF file has BOM for determining encoding but what about other encoding that has no clue how to guess that encoding. I am new java programmer. I have written code for guessing UTF encoding using UTF BOM. but I have problem with other…

java utf

asked Jul 09 '10 at 10:20

paraguma

votes

2 answers

JavaScript - match non-ascii symbols using regex

I want to match all mentioned users in comment. Example: var comment = '@Agneš, @Petar, please take a look at this'; var mentionedUsers = comment.match(/@\w+/g); console.log(mentionedUsers) I'm expecting ["@Agneš", "@Petar"] but getting…

javascript regex utf non-ascii-characters

asked Apr 20 '15 at 14:58

Limon Monte

52,539
45
182
213

votes

1 answer

How do I generate keyboard events that don't have key code in Java?

I'm using Robot class and KeyEvent key codes to generate all the other key events and they work fine, but I also need Hangul key(toggle Korean keyboard). Apparently KeyEvent does not have a key code for this key, so I'm stuck :( Is there a way to…

java keyevent utf

asked Dec 10 '14 at 18:07

Jade

votes

1 answer

java.io.UnsupportedEncodingException: unicode-1-1-utf-7?

Looks like OpenJDK can't handle unicode-1-1-utf-7? How can we remedy that? Caused by: java.io.UnsupportedEncodingException: unicode-1-1-utf-7 at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:71) at…

encoding character-encoding java utf

asked Nov 08 '13 at 14:55

Saqib Ali

3,953
10
55
100

votes

0 answers

Dealing with the null character in a 'text' field with Rails and Postgresql

I recently went through a round of testing with various inputs on my rails application, and I've discovered a problem with how null characters in incoming requests are handled. My application is backed by a Postgresql 8.4 database. It turns out that…

ruby-on-rails utf

asked Jul 17 '12 at 21:52

Belly

votes

2 answers

Perl + Unicode: "Wide Strings" error

I am running Active Perl 5.14 on Windows 7. I am trying to write a program that will read-in a conversion table, then work on a file and replace certain patterns by other patterns - all of the above in Unicode (UTF-8). Here is the beginning of the…

perl unicode utf

asked Feb 15 '12 at 19:29

Helen Craigman

1,443
3
16
25

votes

3 answers

How to set the file.encoding property at exec-maven-plugin?

I trying to exec my standalone application via exec-maven-plugin, but it started with WIN encoding, not UTF-8. I read about Java command line key -Dfile.encoding=UTF-8. How to set this property to my application? Thanx. maven pom: …

maven utf exec-maven-plugin

asked Jan 25 '12 at 15:30

Grigorichev Denis

votes

1 answer

what is filtering invalid utf8 from my PHP website?

My website is fully converted to use utf-8, (mysql, http headers, PHP mb_string etc). Im doing some penetration testing and trying to POST invalid utf to one of the scripts (using BurpSuite). But when I post the invalid utf, an just hex-dump the…

php utf-8 utf mbstring

asked Oct 24 '11 at 00:39

carpii

1,917
4
20
24

votes

2 answers

Boost libraries for UTF-16 strings?

Are there any boost libraries to help with UTF-16 (or higher) strings?

c++ boost utf-16 utf

asked Jun 05 '11 at 10:40

Paul Manta

30,618
31
128
208

votes

1 answer

Transforming unicode characters to a string containing their u+[hexa] representation ("\u2030")

I am working with java 8 and I18N. From my understandings, the .properties files (and subsequent I18N code) asumes that the files are in the "ISO-8859-1" file format. Thus I'm having trouble with characters that cannot be represented in that file…

java java-8 internationalization utf

asked Mar 11 '19 at 08:25

Kalec

2,681
9
30
49

Prev 1 2 3

…

57 58 Next