Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

defines abstract CodePoints and their interactions. It also defines multiple s for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

  • (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
  • Used only for international domain names. (historical contenders were utf-5 and utf-6)
  • GB18030 is the official chinese encoding.
  • UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
  • This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

  • () Early adopters who embraced when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
  • (identical to ucs4 aka modern ) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

857 questions
5
votes
1 answer

can xsd schema validate encoding, e.g. UTF-8, possible?

By using schema, is there any simple/easy way to validate the encoding of an xml msg? Assuming the 1st line of xml is "not" trustworthy? e.g. ignore ?xml version="1.0" encoding="UTF-8" ?
lee
  • 71
  • 1
  • 7
5
votes
3 answers

Why is it necessary to mark continuation bytes in UTF-8?

I've recently been reading up on the UTF-8 variable-width encoding, and I found it strange that UTF-8 specifies the first two bits of every continuation byte to be 10. Range | Encoding -----------------+----------------- 0 - 7f …
crb233
  • 222
  • 1
  • 8
5
votes
4 answers

Delphi: Encoding Strings as Python do

I want to encode strings as Python do. Python code is this: def EncodeToUTF(inputstr): uns = inputstr.decode('iso-8859-2') utfs = uns.encode('utf-8') return utfs This is very simple. But in Delphi I don't understand, how to encode, to force…
durumdara
  • 3,411
  • 4
  • 43
  • 71
5
votes
2 answers

How to use AsynchronousFileChannel to read to a StringBuffer efficiently

So you know you can use AsynchronousFileChannel to read an entire file to a String: AsynchronousFileChannel fileChannel = AsynchronousFileChannel.open(filePath, StandardOpenOption.READ); long len = fileChannel.size(); …
gotch4
  • 13,093
  • 29
  • 107
  • 170
5
votes
1 answer

fatal error: high- and low-surrogate code points are not valid Unicode scalar values

Sometimes while initializing a UnicodeScalar with a value like 57292 yields the following error: fatal error: high- and low-surrogate code points are not valid Unicode scalar values What is this error, why does it occur and how can I prevent it in…
Vatsal Manot
  • 17,695
  • 9
  • 44
  • 80
5
votes
3 answers

how to determine text encoding

I know UTF file has BOM for determining encoding but what about other encoding that has no clue how to guess that encoding. I am new java programmer. I have written code for guessing UTF encoding using UTF BOM. but I have problem with other…
paraguma
  • 187
  • 1
  • 1
  • 7
5
votes
2 answers

JavaScript - match non-ascii symbols using regex

I want to match all mentioned users in comment. Example: var comment = '@Agneš, @Petar, please take a look at this'; var mentionedUsers = comment.match(/@\w+/g); console.log(mentionedUsers) I'm expecting ["@Agneš", "@Petar"] but getting…
Limon Monte
  • 52,539
  • 45
  • 182
  • 213
5
votes
1 answer

How do I generate keyboard events that don't have key code in Java?

I'm using Robot class and KeyEvent key codes to generate all the other key events and they work fine, but I also need Hangul key(toggle Korean keyboard). Apparently KeyEvent does not have a key code for this key, so I'm stuck :( Is there a way to…
Jade
  • 51
  • 2
5
votes
1 answer

java.io.UnsupportedEncodingException: unicode-1-1-utf-7?

Looks like OpenJDK can't handle unicode-1-1-utf-7? How can we remedy that? Caused by: java.io.UnsupportedEncodingException: unicode-1-1-utf-7 at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:71) at…
Saqib Ali
  • 3,953
  • 10
  • 55
  • 100
5
votes
0 answers

Dealing with the null character in a 'text' field with Rails and Postgresql

I recently went through a round of testing with various inputs on my rails application, and I've discovered a problem with how null characters in incoming requests are handled. My application is backed by a Postgresql 8.4 database. It turns out that…
Belly
  • 115
  • 6
4
votes
2 answers

Perl + Unicode: "Wide Strings" error

I am running Active Perl 5.14 on Windows 7. I am trying to write a program that will read-in a conversion table, then work on a file and replace certain patterns by other patterns - all of the above in Unicode (UTF-8). Here is the beginning of the…
Helen Craigman
  • 1,443
  • 3
  • 16
  • 25
4
votes
3 answers

How to set the file.encoding property at exec-maven-plugin?

I trying to exec my standalone application via exec-maven-plugin, but it started with WIN encoding, not UTF-8. I read about Java command line key -Dfile.encoding=UTF-8. How to set this property to my application? Thanx. maven pom:
4
votes
1 answer

what is filtering invalid utf8 from my PHP website?

My website is fully converted to use utf-8, (mysql, http headers, PHP mb_string etc). Im doing some penetration testing and trying to POST invalid utf to one of the scripts (using BurpSuite). But when I post the invalid utf, an just hex-dump the…
carpii
  • 1,917
  • 4
  • 20
  • 24
4
votes
2 answers

Boost libraries for UTF-16 strings?

Are there any boost libraries to help with UTF-16 (or higher) strings?
Paul Manta
  • 30,618
  • 31
  • 128
  • 208
4
votes
1 answer

Transforming unicode characters to a string containing their u+[hexa] representation ("\u2030")

I am working with java 8 and I18N. From my understandings, the .properties files (and subsequent I18N code) asumes that the files are in the "ISO-8859-1" file format. Thus I'm having trouble with characters that cannot be represented in that file…
Kalec
  • 2,681
  • 9
  • 30
  • 49