Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

defines abstract CodePoints and their interactions. It also defines multiple s for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

  • (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
  • Used only for international domain names. (historical contenders were utf-5 and utf-6)
  • GB18030 is the official chinese encoding.
  • UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
  • This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

  • () Early adopters who embraced when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
  • (identical to ucs4 aka modern ) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

857 questions
2
votes
1 answer

PHP utf8 intval

I am reading a UTF8 file and storing data read from file in an array. However, when using that data in mysql queries, I am getting problems. I thought that I will convert all int values explicitly using intval(..) before using them. But…
workwise
  • 1,003
  • 16
  • 33
2
votes
1 answer

ARC2 (PHP semantic web library) wrongly double-converts UTF-8 file to UTF-8

Using ARC2, textual data gets corrupted. My RDF input file is in UTF-8. It gets loaded in ARC2, which uses a MySQL backend, through a LOAD query. The MySQL database is in UTF-8 too, as a check with PHPMyAdmin makes sure. However,…
MattiSG
  • 3,796
  • 1
  • 21
  • 32
2
votes
2 answers

Character encoding messing up Perl regex

Short version: here is a minimal failing example: $> echo xóx > /tmp/input $> hex /tmp/input 0x00000000: 78 c3 b3 78 0a $> perl -e 'open F, "<", "/tmp/input" or die $!; while() { if ($_=~/x(\w)x/) { print…
spraff
  • 32,570
  • 22
  • 121
  • 229
2
votes
2 answers

How do i disable utf-8 escaping in rails to_json output \u2013

Json is supposed to be able to be parsed with UTF characters included. In particular I'm talking about -. Or as it seems to be getting encoded: \u2013 This is for a json api output, and there's no need to be escaping these &'s that are in text…
Ashley Raiteri
  • 700
  • 8
  • 17
2
votes
3 answers

Python saving string to file. Unicode error

I am extracting data from a Google spreadsheet using Spreadsheet API in Python. I can print every row of my spreadsheet on the commandline with a for loop but some of the text contain symbols e.g. celsius degree symbol(little circle). As I print…
Tyler Durden
  • 43
  • 1
  • 4
2
votes
2 answers

Convert a string to a 'InvariantCulture'

I have the following string an-ca an-ca If you will look it closely you will see that they are different! To compare 2 string like this I found this solution: if (String.Compare(str1, str2, StringComparison.InvariantCulture) == 0) ... So I have 2…
Yacov
  • 1,060
  • 14
  • 27
2
votes
4 answers

should I use utf-8 or utf-16 or utf-32 for my multilingual cms?

Besides the difference in how characters are stored, are there any special characters in any language utf-32 can display and utf-8 cannot?
user796443
2
votes
2 answers

How to store a UTF-16 character as a string in c#?

How can I print a character whose UTF-16 representation is feff2031? When I try the following I get "?" as the result: String million = "\u2030"; The character I want is "per million". See PER MILLE for more information. UTF-8 (hex) 0xE2 0x80…
tequilaras
  • 277
  • 3
  • 7
  • 15
2
votes
3 answers

Problem to process visually identical looking characters (umlauts)

This may probably be a more general issue related to character encoding, but since I came across the issue while coding an outer join of two dataframes, I post it with a Python code example. On the bottom line the question is: why is ö technically…
Madamadam
  • 842
  • 2
  • 12
  • 24
2
votes
2 answers

Replace éàçè... with equivalent "eace" In GWT

I tried s=Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", ""); But it seems that GWT API doesn't provide such fonction. I tried also : s=s.replace("é",e); But it doesn't work either The scenario is I'am trying to générate…
Momo
  • 2,471
  • 5
  • 31
  • 52
2
votes
1 answer

In Android Ndk programming getting UTF String

As you can see, I get jbyte *str form the utf string. Then each character of string has two jbytes else one byte? JNIEXPORT jstring JNICALL Java_Prompt_getLine(JNIEnv *env, jobject obj, jstring prompt) { char buf[128]; const jbyte *str; …
smiler
  • 315
  • 1
  • 3
  • 8
2
votes
0 answers

Arabic words aren't displayed properly in DrRacket

I work on Arabic scripted texts in DrRacket but the characters stand seprate, they have to be attached to each other. The second problem is that DrRacket reads them left-to-right like in Latin script. When I am posting here in order to show how they…
Sandy
  • 21
  • 2
2
votes
1 answer

Check if file/blob object is valid UTF-8

I need a function that can check if a file or blob object is valid UTF-8. I can get the text and check for � characters, but if the string has that character to begin with, the function would mark it as invalid. function isUTF8(blob) { return…
luek baja
  • 1,475
  • 8
  • 20
2
votes
2 answers

Maximum number of codepoints in a grapheme cluster

I am using the C++ ICU library. I wish to split a utf-8 string into approximately equal chunks. However, I want the chunks to be demarcated at grapheme cluster boundaries. I do not wish to convert my entire string into utf-16 to do this for both…
2
votes
2 answers

C++ test for validation UTF-8

I need to write unit tests for UTF-8 validation, but I don't know how to write incorrect UTF-8 cases in C++: TEST(validation, Tests) { std::string str = "hello"; EXPECT_TRUE(validate_utf8(str)); // I need incorrect UTF-8 cases } How…
QuickDzen
  • 247
  • 1
  • 11