Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

defines abstract CodePoints and their interactions. It also defines multiple s for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

  • (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
  • Used only for international domain names. (historical contenders were utf-5 and utf-6)
  • GB18030 is the official chinese encoding.
  • UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
  • This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

  • () Early adopters who embraced when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
  • (identical to ucs4 aka modern ) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

857 questions
3
votes
0 answers

dealing with raw vector from r getURLContent

I am using getURLContent(url, userpasswrd, httpauth=1L, binary=TRUE) to download a csv file from a server. The data I get after using this getURLContent () is a raw vector with contents "4d" "43" "4e" "2c" "45" "4e" I know the dataset should have…
3
votes
3 answers

How do I check that string has only international letters and spaces in UTF8 in PHP?

In Python I could've converted it to Unicode and do '(?u)^[\w ]+$' regex search, but PHP doesn't seem to understand international \w, or does it?
Slava V
  • 16,686
  • 14
  • 60
  • 63
3
votes
1 answer

Python terminates process with exit code -1073741819

I am trying to read a csv file (~190MB in size) into a pandas dataframe, but I am getting this error. I am running the Pycharm IDE from JetBrains Process finished with exit code -1073741819 (0xC0000005) The code I am trying to run is below: from…
Nitin Kashyap
  • 184
  • 1
  • 1
  • 13
3
votes
1 answer

UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 47: ordinal not in range(128)

I am trying to write data in a StringIO object using Python and then ultimately load this data into a postgres database using psycopg2's copy_from() function. First when I did this, the copy_from() was throwing an error: ERROR: invalid byte…
user3422637
  • 3,967
  • 17
  • 49
  • 72
3
votes
1 answer

php validate if string is alphabetic including cyrillic, greek or any unicode letter

I am trying to validate string is alphabetic including multiple character sets: function is_string($str){ return preg_match("/^[a-zA-Z\p{Cyrillic}\p{Cyrillic}]+$/u", $str) ? TRUE : FALSE; } but it fails if string contains some other characters…
Sabri Aziri
  • 4,084
  • 5
  • 30
  • 45
3
votes
2 answers

HTML numerical to UTF

How can i convert this $langClarContent = > &# 1059.,ч.,и.,&# 1090.,&# 1077.,&# 1083.,Dokeos &# 1077., &# 1089.,&# 1080.,&# 1089.,&# 1090.,&# 1077.,&# 1084., &# 1079.,&# 1072., &# 1091.,&# 1087.,&# 1088.,&# 1072.,&# 1074.,&# 1091.,&#…
Darwly
  • 344
  • 1
  • 6
  • 22
3
votes
1 answer

How to read UTF file char by char in Python

I have UTF-8 file and I want to replace some characters that are 2 bytes with some HTML tags. I wanted to make Python script for that. Just read file, char by char, and put some if and so on. Problem that I have is following, if I read char by…
WebOrCode
  • 6,852
  • 9
  • 43
  • 70
3
votes
3 answers

How to convert an ASCII string to an UTF8 string in C++?

How to convert an ASCII std::string to an UTF8 (Unicode) std::string in C++?
user301649
3
votes
1 answer

Python: UTF-8 german special chars

I'm searching for files in a python script and storing the filepathes. The problem is, that in some cases there are special chars like ö ä ü inside (UTF-8 Table hex U+00C4 U+00D6 U+00DC etc.) When I print the path with "print" it is shown…
rainer
  • 173
  • 1
  • 3
  • 9
3
votes
1 answer

C++: String with multiple languages

This is my first attempt at dealing with multiple languages in a program. I would really appreciate if someone could provide me with some study material and how to approach this type of issue. The question is representing a string which has multiple…
madu
  • 5,232
  • 14
  • 56
  • 96
3
votes
1 answer

Can Unicode NFC normalization increase the length of a string?

If I apply Unicode Normalization Form C to a string, will the number of code points in the string ever increase?
Daniel Trebbien
  • 38,421
  • 18
  • 121
  • 193
3
votes
4 answers

Page with UTF-8 encoding sends data to MySQL with UTF-8 encoding but entry is scrambled

I realize there's a dozen similar questions, but none of the solutions suggested there work in this case. I have a PHP variable on a page, initialized as: $hometeam="Крылья Советов"; //Cyrrilic string When I print it out on the page, it prints…
sveti petar
  • 3,637
  • 13
  • 67
  • 144
3
votes
1 answer

how to make JSON.stringify encode UTF characters

I'm writing a JS that run using windows cscript.exe. My JS is loading JSON object from file, adds a parameter and saves it back to file (using json2.min.js implementation). I'm using JSON.parse(text) to parse the text into JSON object, and then…
user1283002
  • 391
  • 1
  • 4
  • 12
3
votes
3 answers

UTF-8 issue in linux

String departmentName = request.getParameter("dept_name"); departmentName = new String(departmentName.getBytes(Charset.forName("UTF8")),"UTF8"); System.out.println(departmentName);//O/p: composés In windows, the displayed output is what I expected…
Gagan
  • 345
  • 1
  • 3
  • 9
3
votes
2 answers

Change Encoding in C#?

Theoretical question : Let's say there is one source which knows only how to transmit ASCII chars. (0..127) And let's say there is an endpoint which receives these chars . Can the endpoint decode those chars as utf8 ? ascii chars ... …
Royi Namir
  • 144,742
  • 138
  • 468
  • 792