Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

defines abstract CodePoints and their interactions. It also defines multiple s for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

  • (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
  • Used only for international domain names. (historical contenders were utf-5 and utf-6)
  • GB18030 is the official chinese encoding.
  • UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
  • This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

  • () Early adopters who embraced when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
  • (identical to ucs4 aka modern ) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

857 questions
4
votes
1 answer

How can I decode this string in python?

I downloaded a dataset of facebook messages and it was formatted like this: f\u00c3\u00b8rste student It's supposed to be første student but I cant seem to decode it correctly. I tried: str = 'f\u00c3\u00b8rste student' print(str) # 'første…
vhflat
  • 561
  • 6
  • 19
4
votes
2 answers

Node js Convert from utf-8

I have a product names in mysql but the some names are with Ö Ə Ü etc. I have to convert this chars to O E U and write to the jpeg file name. I try to use utf8 package but it convert to üzlük for example. How can I do this?
user8283671
4
votes
1 answer

Java or Scala. How to convert characters like \x22 into String

I have a string that looks like this: {\x22documentReferer\x22:\x22http:\x5C/\x5C/pikabu.ru\x5C/freshitems.php\x22} How could I convert this into a readable JSON? I've found different slow solutions like here with regEx Have already…
Artem
  • 1,157
  • 1
  • 14
  • 24
4
votes
2 answers

Detect charset of file dynamically in c++

I am trying to read a file which may have any charset/codePage, but I don't which locale to set in order to read the file correctly. Below is my code snippet in which I am trying to read a file having charset as windows-1256, but I want to get the…
Saurabh Kathpalia
  • 269
  • 1
  • 3
  • 14
4
votes
2 answers

javax mail: UTF-8 encoding issue

I have seen several questions about this, but none have solved my problem. I have a Chinese email with a pdf attachment. All the text is valid UTF-8 until it is included in the MultiPart email. Problem: The text in the email is garbage characters…
Jake
  • 4,322
  • 6
  • 39
  • 83
4
votes
3 answers

Convert Unicode code points to UTF-8 and UTF-32

I can't think of a way to remove the leading zeros. My goal was in a for loop to then create the UTF-8 and UTF-32 versions of each number. For example, with UTF-8 wouldn't I have to remove the leading zeros? Does anyone have a solution for how to…
Joe Caraccio
  • 1,899
  • 3
  • 24
  • 41
4
votes
1 answer

Character showing up as diamond question mark only at end of line (Python>Text)

I'm working on a Python file that inputs a text file with Japanese characters (UTF-8) in it, takes some of the text, and writes it into a new UTF-8 text file. The problem I'm coming across is that for some reason whenever the Japanese character だ…
user3597545
  • 53
  • 2
  • 4
4
votes
1 answer

Is php trim mb safe

I know that there is no mb_trim version of the trim. I have links to the dozen of articles for how to implement one using preg_replace. The question I have, is the usual trim with default chars mb safe? That is, is there any example of multibyte…
loshad vtapkah
  • 429
  • 4
  • 11
4
votes
3 answers

Python2.7, what does the special characters mean in the utf-32 encoding output of a unicode string?

I was playing around with python's unicode and encoding methods, I used the special character "‽" and a Chinese character to see how different utf encoding deal with these characters, and I get these output. >>> a = u"‽" >>> encoded_a =…
David Zheng
  • 797
  • 7
  • 21
4
votes
1 answer

How decode string on PowerShell

I have file with string like this \u0440\u043e How I can decode this string on PowerShell?
4
votes
1 answer

Iconv is converting to UTF-16 instead of UTF-8 when invoked from powershell

I have a problem while trying to batch convert the encoding of some files from ISO-8859-1 to UTF-8 using iconv in a powershell script. I have this bat file, that works ok: for %%f in (*.txt) do ( echo %%f C:\"Program…
fdediego
  • 115
  • 1
  • 6
4
votes
3 answers

Writing on text file, accents and special characters not displaying correctly

Here's what I'm doing, I'm web crawling for my personal use on a website to copy the text and put the chapters of a book on text format and then transform it with another program to pdf automatically to put it in my cloud. Everything is fine until…
Seraf
  • 850
  • 1
  • 17
  • 34
4
votes
2 answers

Android count characters used by emojis

I am trying to get the number of characters the emojis in my EditText have used up. The reason for this is my EditText has a maxLength of 25 chars. I have looked at other examples of getting the count such as:…
Gooner
  • 387
  • 2
  • 23
4
votes
2 answers

delphi vs c# post returns different strings - utf problem?

I'm posting two forms - one in c# and one in delphi. But the result string seems to be different: c# returns: ¤@@1@@@@1@@@@1@@xśm˱Â0Đ... delphi returns: #$1E'@@1@@@@1@@@@1@@x'#$009C... and sice both are compressed streams I'm getting errors while…
argh
  • 933
  • 13
  • 37
4
votes
5 answers

MySql UTF encoding

java.sql.SQLException: Incorrect string value: '\xAC\xED\x00\x05sr...' for column 'xxxx' The column is a longtext in MYSQL with utf8 charset and utf8_general_ci collation. What is wrong?
user121196
  • 30,032
  • 57
  • 148
  • 198