Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

defines abstract CodePoints and their interactions. It also defines multiple s for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

  • (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
  • Used only for international domain names. (historical contenders were utf-5 and utf-6)
  • GB18030 is the official chinese encoding.
  • UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
  • This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

  • () Early adopters who embraced when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
  • (identical to ucs4 aka modern ) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

857 questions
8
votes
1 answer

Char to UTF code in vbscript

I'd like to create a .properties file to be used in a Java program from a VBScript. I'm going to use some strings in languages that use characters outside the ASCII map. So, I need to replace these characters for its UTF code. This would be \u0061…
Carlos Blanco
  • 8,592
  • 17
  • 71
  • 101
8
votes
1 answer

Git cant diff or merge .cs file in utf-16 encoding

A friend and I were working on the same .cs file at the same time and when there's a merge conflict git points out there's a conflict but the file isnt loaded with the usual "HEAD" ">>>" stuff because the .cs files were binary files. So we added…
user1879789
  • 292
  • 3
  • 8
8
votes
4 answers

Spanish characters in Android Studio

I've got a problem with Android Estudio, i'm trying to develope an application but the characters like "¿" or "ñ" and "á,é,ó,í,ú" don't appear correctly when i run the application. I've tried to solve the problem changing the encoding to UTF-8 but…
Dv Apps
  • 153
  • 2
  • 9
8
votes
2 answers

Why is sys.getdefaultencoding() different from sys.stdout.encoding and how does this break Unicode strings?

I spent a few angry hours looking for the problem with Unicode strings that was broken down to something that Python (2.7) hides from me and I still don't understand. First, I tried to use u".." strings consistently in my code, but that resulted in…
Aleksandar Savkov
  • 2,894
  • 3
  • 24
  • 30
7
votes
1 answer

MSBuild.exe output encoding

I use MSBuild.exe for building solution on machine with russian language. But in TeamCity build log all russian chars in wrong encoding. How to setup MSBuild.exe for properly output (UTF-8 for example)?
Dmitriy Kudinov
  • 1,051
  • 5
  • 23
  • 31
7
votes
2 answers

Reading UTF-8 with BOM in ruby 2.5.0

Is there a way to read files encoded in UTF-8 with BOM (Byte order marks) on Ruby v2.5.0? On Ruby 2.3.1 this used to work: csv = CSV.open(file_path, encoding: 'bom|utf-8') However, on 2.5.0 the following error ocurrs: ArgumentError: unknown…
romeu.hcf
  • 73
  • 1
  • 7
7
votes
5 answers

UTF usage in C++ code

What is the difference between UTF and UCS. What are the best ways to represent not European character sets (using UTF) in C++ strings. I would like to know your recommendations for: Internal representation inside the code For string manipulation…
Martin York
  • 257,169
  • 86
  • 333
  • 562
7
votes
4 answers

Do I need supplementary plane?

I think the question is pretty simple, do I need all the rest of the stuff in Unicode after the basic plane? What kind of stuff is included and is that really needed? (and for what purposes?) Thanks.
Tower
  • 98,741
  • 129
  • 357
  • 507
6
votes
2 answers

How to convert circled numbers to numbers ? (① to 1)

I would like to convert numbers from a string I receive after an OCR recognition over Japanese text. For example, when I extract a date: ③① 年 ⑫ 月 ①③ 日 I would like to convert it to: 31 年 12 月 13 日 What would be the best way to achieve it ?
Jonathan Muller
  • 7,348
  • 2
  • 23
  • 31
6
votes
3 answers

Persist UTF-8 as Default Encoding

I tried to persist UTF-8 as the default encoding in Python. I tried: >>> import sys >>> sys.getdefaultencoding() 'ascii' And I also tried: >>> import sys >>> reload(sys) >>> sys.setdefaultencoding('UTF8') >>>…
DenCowboy
  • 13,884
  • 38
  • 114
  • 210
6
votes
4 answers

PHP MySQL database strange characters

I'm trying to output product information stored in a MySQL database, but it's writing out some strange characters, like a diamond with a question mark inside of it. I think it may be an encoding/UTF8 issue, but I've specified the encoding I…
user231733
5
votes
1 answer

how strings are stored by python in computers?

I believe most of you who are familiar with Python have read Dive Into Python 3. In chapter 4.3, it says this: In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python…
endless
  • 97
  • 1
  • 4
5
votes
1 answer

Response.WriteFile() Strange characters issue

Hello in my aspx page using MVC 3, I have the following code: <%Response.WriteFile("/Content/Bing.htm"); %> Which is an include file that contains BING search box code. At the top of the containing DIV, a strange character is appearing:  I…
Cyberdrew
  • 1,832
  • 1
  • 19
  • 39
5
votes
3 answers

idn_to_ascii() in 5.2.17

There's a very handy function idn_to_ascii() in PHP 5.3, but I'm running 5.2.17 and I can't change that. How do I encode Unicode domain names to ascii then?
donk
  • 1,540
  • 4
  • 23
  • 46
5
votes
2 answers

Syllabification of Devanagari

I am trying to syllabify devanagari words धर्मक्षेत्रे -> धर् मक् षेत् रे dharmakeshetre -> dhar mak shet re wd.split('्') I get the result as : ['धर', 'मक', 'षेत', 'रे'] Which is partially correct I try another word कुरुक्षेत्र -> कु रुक् षेत्…
Echchama Nayak
  • 971
  • 3
  • 23
  • 44