Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

defines abstract CodePoints and their interactions. It also defines multiple s for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

  • (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
  • Used only for international domain names. (historical contenders were utf-5 and utf-6)
  • GB18030 is the official chinese encoding.
  • UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
  • This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

  • () Early adopters who embraced when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
  • (identical to ucs4 aka modern ) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

857 questions
3
votes
1 answer

How to check if byte array is valid UTF-8 String

I'm decoding messages with javax.crypto.Cipher and as an output I get byte[]. What is the fastest way to check if my key is correct and byte[] is valid string?
marcelby
  • 83
  • 1
  • 2
  • 5
3
votes
3 answers

Easy way of converting php serialized strings to utf8?

I'm trying to convert a greek database to utf8. At this point, I've figured out how to do it (via MySQL, not through the iconv() function) but I have a problem: The application stores lots of data in the database in php serialized format (via…
Lea Verou
  • 23,618
  • 9
  • 46
  • 48
3
votes
1 answer

Using DataOutputStream in Java gives some Chinese output

I was using DataOutputStream in Java today, but it gave me a Chinese output, that was absolutely NOT what I had expected... Can someone please spot the error in the code? private void generateButtonActionPerformed(java.awt.event.ActionEvent evt) { …
Abhigyan
  • 334
  • 3
  • 13
3
votes
1 answer

Convert csv text from utf-16 to ascii or read in correctly

I have problems while reading text from a csv file. An example line from the csv file looks like this:" 1477-7819-4-45-2 Angiolymphatic Invasion (H & E 400 Ã)." I guess that the problem is the coding of the text, so I decided to change it to…
Jürgen K.
  • 3,427
  • 9
  • 30
  • 66
3
votes
2 answers

Powershell and UTF-8

I have an html file test.html created with atom which contains: Testé encoding utf-8 When I read it with Powershell console (I'm using French Windows) Get-Content -Raw test.html I get back this: Testé encoding utf-8 Why is the accent character…
user310291
  • 36,946
  • 82
  • 271
  • 487
3
votes
1 answer

Perl | Print ASCII, but backslashed other

I want print 95 ASCII symblols unchanged, but for others to print its codes. How make it in pure perl? 'unpack' function? Any module? print BackSlashed('test folder'); # expected test\040folder print BackSlashed('test тестовая folder'); # expected…
Anton Shevtsov
  • 1,279
  • 4
  • 16
  • 34
3
votes
1 answer

Getting exception on inserting Cyrillic text: Incorrect string value: '\xD1\x82\xD0\xB5\xD1\x81...' for column ' ' at row 1 Error Code: 1366

I have MySQL table with column and its properties: name: message datatype: varchar(300) default/expression: utf8 I trying to insert new data in table that contain text in Cyrillic. Here's my sql query: INSERT INTO db.log…
Lena
  • 99
  • 4
  • 14
3
votes
1 answer

Using boost locale generator correctly

I want to store utf8 characters in my std::strings. For that I used boost::locale conversion routines. In my first test everything works as expected: #include std::string utf8_string = boost::locale::conv::to_utf("Grüssen",…
Reine Elemente
  • 131
  • 1
  • 11
3
votes
0 answers

mysql won't import table as unicode even tho all variables are set to unicode

I have just updated my cnf properties to add the following: init_connect = 'SET collation_connection = utf8_unicode_ci; SET NAMES utf8;' character-set-client = utf8 character-set-server =…
user3299633
  • 2,971
  • 3
  • 24
  • 38
3
votes
1 answer

Strange collation with postgresql

I noticed a strange collation issue with postgresql-9.5 as it was giving different output to a Python script. As I understand it, normally characters are compared one at a time from left to right when sorting: select 'ab' < 'ac'; t select 'abX' <…
EoghanM
  • 25,161
  • 23
  • 90
  • 123
3
votes
1 answer

using special characters in jquery html not showing correctly

I'm using some special characters and they aren't displayed correctly, I don't know why! I defined it like this : var grade_value = 'Wešto izšamešđu'; and later I do this: $('.list').append( "

"+ grade_value + "

" ); This…
user6145033
3
votes
1 answer

UTF-8 vs UTF8 in XML files

What is the correct UTF8 encoding declaration in XML files? I have seen both. ... or ...
sdc
  • 2,603
  • 1
  • 27
  • 40
3
votes
4 answers

UTF-8 conversion to real letter

I need help with one of my projects. I'm cleaning a large set of data to bulk insert into microsoft SQL. The data is like 10million lines. But I created a script just to extract the first 1000 for cleaning assuming the rest are the same. I noticed…
DavidA
  • 1,809
  • 3
  • 13
  • 12
3
votes
2 answers

error: unmappable character for encoding UTF8 after GIT merge

After yet another git pull my project stopped building with bunch of messages: error: unmappable character for encoding UTF-8 The messages point to the copyright symbol found in some of the files headers. There are many more files with same symbol…
user656449
  • 2,950
  • 2
  • 30
  • 43
3
votes
1 answer

Bits of a Character in Java

As I know a is an 8 bits character, â is a 16 bits character: How to know a character is 8 bits or 16 bits or higher? Why â character could not present at 8 bits? a or â just UI form, how do they look like in bits form? 97 is the code of a, how to…
Hoang Nguyen
  • 61
  • 1
  • 13