Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

unicode defines abstract CodePoints and their interactions. It also defines multiple encodings for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

utf-8 (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
punycode Used only for international domain names. (historical contenders were utf-5 and utf-6)
GB18030 is the official chinese encoding.
UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
utf-7 This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

utf-16 (utf-16le) Early adopters who embraced ucs2 when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
utf-32 (identical to ucs4 aka modern ucs) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

Wikipedia on Unicode

857 questions

votes

1 answer

PHP utf8 intval

I am reading a UTF8 file and storing data read from file in an array. However, when using that data in mysql queries, I am getting problems. I thought that I will convert all int values explicitly using intval(..) before using them. But…

php utf-8 character-encoding utf

asked Feb 18 '12 at 08:03

workwise

1,003
16
33

votes

1 answer

ARC2 (PHP semantic web library) wrongly double-converts UTF-8 file to UTF-8

Using ARC2, textual data gets corrupted. My RDF input file is in UTF-8. It gets loaded in ARC2, which uses a MySQL backend, through a LOAD query. The MySQL database is in UTF-8 too, as a check with PHPMyAdmin makes sure. However,…

php mysql encoding rdf utf

asked Feb 17 '12 at 15:44

MattiSG

3,796
1
21
32

votes

2 answers

Character encoding messing up Perl regex

Short version: here is a minimal failing example: $> echo xóx > /tmp/input $> hex /tmp/input 0x00000000: 78 c3 b3 78 0a $> perl -e 'open F, "<", "/tmp/input" or die $!; while() { if ($_=~/x(\w)x/) { print…

html perl utf-8 character-encoding utf

asked Feb 15 '12 at 15:22

spraff

32,570
22
121
229

votes

2 answers

How do i disable utf-8 escaping in rails to_json output \u2013

Json is supposed to be able to be parsed with UTF characters included. In particular I'm talking about -. Or as it seems to be getting encoded: \u2013 This is for a json api output, and there's no need to be escaping these &'s that are in text…

ruby-on-rails json escaping utf

asked Jan 13 '12 at 08:45

Ashley Raiteri

votes

3 answers

Python saving string to file. Unicode error

I am extracting data from a Google spreadsheet using Spreadsheet API in Python. I can print every row of my spreadsheet on the commandline with a for loop but some of the text contain symbols e.g. celsius degree symbol(little circle). As I print…

python unicode ascii utf

asked Dec 17 '11 at 09:00

Tyler Durden

votes

2 answers

Convert a string to a 'InvariantCulture'

I have the following string an-ca an-ca If you will look it closely you will see that they are different! To compare 2 string like this I found this solution: if (String.Compare(str1, str2, StringComparison.InvariantCulture) == 0) ... So I have 2…

c# encoding utf

asked Dec 11 '11 at 13:29

Yacov

1,060
14
27

votes

4 answers

should I use utf-8 or utf-16 or utf-32 for my multilingual cms?

Besides the difference in how characters are stored, are there any special characters in any language utf-32 can display and utf-8 cannot?

utf

asked Nov 17 '11 at 08:45

user796443

votes

2 answers

How to store a UTF-16 character as a string in c#?

How can I print a character whose UTF-16 representation is feff2031? When I try the following I get "?" as the result: String million = "\u2030"; The character I want is "per million". See PER MILLE for more information. UTF-8 (hex) 0xE2 0x80…

c# utf

asked Nov 14 '11 at 21:17

tequilaras

votes

3 answers

Problem to process visually identical looking characters (umlauts)

This may probably be a more general issue related to character encoding, but since I came across the issue while coding an outer join of two dataframes, I post it with a Python code example. On the bottom line the question is: why is ö technically…

python character-encoding utf

asked Mar 25 '23 at 12:17

Madamadam

votes

2 answers

Replace éàçè... with equivalent "eace" In GWT

I tried s=Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", ""); But it seems that GWT API doesn't provide such fonction. I tried also : s=s.replace("é",e); But it doesn't work either The scenario is I'am trying to générate…

gwt unicode normalization utf unicode-normalization

asked Sep 21 '11 at 13:35

Momo

2,471
5
31
52

votes

1 answer

In Android Ndk programming getting UTF String

As you can see, I get jbyte *str form the utf string. Then each character of string has two jbytes else one byte? JNIEXPORT jstring JNICALL Java_Prompt_getLine(JNIEnv *env, jobject obj, jstring prompt) { char buf[128]; const jbyte *str; …

android string android-ndk utf

asked Sep 13 '11 at 02:03

smiler

votes

0 answers

Arabic words aren't displayed properly in DrRacket

I work on Arabic scripted texts in DrRacket but the characters stand seprate, they have to be attached to each other. The second problem is that DrRacket reads them left-to-right like in Latin script. When I am posting here in order to show how they…

string racket arabic utf text-direction

asked Oct 02 '22 at 11:14

Sandy

votes

1 answer

Check if file/blob object is valid UTF-8

I need a function that can check if a file or blob object is valid UTF-8. I can get the text and check for � characters, but if the string has that character to begin with, the function would mark it as invalid. function isUTF8(blob) { return…

javascript utf-8 utf

asked Aug 13 '22 at 20:24

luek baja

1,475
8
20

votes

2 answers

Maximum number of codepoints in a grapheme cluster

I am using the C++ ICU library. I wish to split a utf-8 string into approximately equal chunks. However, I want the chunks to be demarcated at grapheme cluster boundaries. I do not wish to convert my entire string into utf-16 to do this for both…

c++ utf icu breakiterator grapheme-cluster

asked Feb 06 '22 at 20:24

Nick Deguillaume

votes

2 answers

C++ test for validation UTF-8

I need to write unit tests for UTF-8 validation, but I don't know how to write incorrect UTF-8 cases in C++: TEST(validation, Tests) { std::string str = "hello"; EXPECT_TRUE(validate_utf8(str)); // I need incorrect UTF-8 cases } How…

c++ testing utf-8 utf

asked Jan 31 '22 at 13:31

QuickDzen

Prev 1 2 3

…

57 58 Next