Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

unicode defines abstract CodePoints and their interactions. It also defines multiple encodings for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

utf-8 (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
punycode Used only for international domain names. (historical contenders were utf-5 and utf-6)
GB18030 is the official chinese encoding.
UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
utf-7 This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

utf-16 (utf-16le) Early adopters who embraced ucs2 when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
utf-32 (identical to ucs4 aka modern ucs) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

Wikipedia on Unicode

857 questions

votes

1 answer

How can I decode this string in python?

I downloaded a dataset of facebook messages and it was formatted like this: f\u00c3\u00b8rste student It's supposed to be første student but I cant seem to decode it correctly. I tried: str = 'f\u00c3\u00b8rste student' print(str) # 'fÃ¸rste…

python unicode utf

asked Dec 03 '18 at 21:50

vhflat

votes

2 answers

Node js Convert from utf-8

I have a product names in mysql but the some names are with Ö Ə Ü etc. I have to convert this chars to O E U and write to the jpeg file name. I try to use utf8 package but it convert to Ã¼zlÃ¼k for example. How can I do this?

node.js decode encode fs utf

asked Mar 28 '18 at 06:46

user8283671

votes

1 answer

Java or Scala. How to convert characters like \x22 into String

I have a string that looks like this: {\x22documentReferer\x22:\x22http:\x5C/\x5C/pikabu.ru\x5C/freshitems.php\x22} How could I convert this into a readable JSON? I've found different slow solutions like here with regEx Have already…

java json scala decode utf

asked Oct 31 '17 at 09:34

Artem

1,157
1
14
24

votes

2 answers

Detect charset of file dynamically in c++

I am trying to read a file which may have any charset/codePage, but I don't which locale to set in order to read the file correctly. Below is my code snippet in which I am trying to read a file having charset as windows-1256, but I want to get the…

c++ unicode character-encoding utf icu

asked May 11 '17 at 12:30

Saurabh Kathpalia

votes

2 answers

javax mail: UTF-8 encoding issue

I have seen several questions about this, but none have solved my problem. I have a Chinese email with a pdf attachment. All the text is valid UTF-8 until it is included in the MultiPart email. Problem: The text in the email is garbage characters…

java jakarta-mail utf

asked Feb 24 '17 at 20:24

Jake

4,322
6
39
83

votes

3 answers

Convert Unicode code points to UTF-8 and UTF-32

I can't think of a way to remove the leading zeros. My goal was in a for loop to then create the UTF-8 and UTF-32 versions of each number. For example, with UTF-8 wouldn't I have to remove the leading zeros? Does anyone have a solution for how to…

c utf-8 utf

asked Feb 02 '17 at 21:25

Joe Caraccio

1,899
3
24
41

votes

1 answer

Character showing up as diamond question mark only at end of line (Python>Text)

I'm working on a Python file that inputs a text file with Japanese characters (UTF-8) in it, takes some of the text, and writes it into a new UTF-8 text file. The problem I'm coming across is that for some reason whenever the Japanese character だ…

python text character utf

asked Jan 23 '17 at 17:29

user3597545

votes

1 answer

Is php trim mb safe

I know that there is no mb_trim version of the trim. I have links to the dozen of articles for how to implement one using preg_replace. The question I have, is the usual trim with default chars mb safe? That is, is there any example of multibyte…

php unicode trim utf mbstring

asked Sep 21 '16 at 12:59

loshad vtapkah

votes

3 answers

Python2.7, what does the special characters mean in the utf-32 encoding output of a unicode string?

I was playing around with python's unicode and encoding methods, I used the special character "‽" and a Chinese character to see how different utf encoding deal with these characters, and I get these output. >>> a = u"‽" >>> encoded_a =…

python unicode encoding utf

asked May 18 '16 at 19:30

David Zheng

votes

1 answer

How decode string on PowerShell

I have file with string like this \u0440\u043e How I can decode this string on PowerShell?

powershell utf

asked May 18 '16 at 14:57

Sergey B. Hizof

votes

1 answer

Iconv is converting to UTF-16 instead of UTF-8 when invoked from powershell

I have a problem while trying to batch convert the encoding of some files from ISO-8859-1 to UTF-8 using iconv in a powershell script. I have this bat file, that works ok: for %%f in (*.txt) do ( echo %%f C:\"Program…

encoding powershell iconv utf

asked Aug 30 '10 at 22:37

fdediego

votes

3 answers

Writing on text file, accents and special characters not displaying correctly

Here's what I'm doing, I'm web crawling for my personal use on a website to copy the text and put the chapters of a book on text format and then transform it with another program to pdf automatically to put it in my cloud. Everything is fine until…

python encoding utf-8 web-crawler utf

asked Nov 17 '15 at 16:22

Seraf

votes

2 answers

Android count characters used by emojis

I am trying to get the number of characters the emojis in my EditText have used up. The reason for this is my EditText has a maxLength of 25 chars. I have looked at other examples of getting the count such as:…

android emoji utf

asked Oct 12 '15 at 14:32

Gooner

votes

2 answers

delphi vs c# post returns different strings - utf problem?

I'm posting two forms - one in c# and one in delphi. But the result string seems to be different: c# returns: ¤@@1@@@@1@@@@1@@xśmË±Â0Đ... delphi returns: #$1E'@@1@@@@1@@@@1@@x'#$009C... and sice both are compressed streams I'm getting errors while…

c# delphi post delphi-2010 utf

asked Jun 15 '10 at 10:37

argh

votes

5 answers

MySql UTF encoding

java.sql.SQLException: Incorrect string value: '\xAC\xED\x00\x05sr...' for column 'xxxx' The column is a longtext in MYSQL with utf8 charset and utf8_general_ci collation. What is wrong?

mysql character-encoding utf

asked Apr 21 '10 at 23:08

user121196

30,032
57
148
198

Prev 1 2 3

…

57 58 Next