Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

unicode defines abstract CodePoints and their interactions. It also defines multiple encodings for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

utf-8 (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
punycode Used only for international domain names. (historical contenders were utf-5 and utf-6)
GB18030 is the official chinese encoding.
UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
utf-7 This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

utf-16 (utf-16le) Early adopters who embraced ucs2 when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
utf-32 (identical to ucs4 aka modern ucs) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

Wikipedia on Unicode

857 questions

votes

1 answer

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xbe in position 2: invalid start byte

Do you know how I can fix this problem in PyTorch 1.9? File "main.py", line 138, in main checkpoint = torch.load(args.resume) File "/scratch3/venv/fashcomp/lib/python3.8/site-packages/torch/serialization.py", line 608, in load return…

python utf-8 pytorch pickle utf

asked Aug 20 '21 at 21:00

Mona Jalal

34,860
64
239
408

votes

2 answers

How to escape unicode special chars in string and write it to UTF encoded file

What I aim to achieve is to: string like: Bitte überprüfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente. convert to: 'Bitte \u00FCberpr\u00FCfen Sie, ob die Dokumente erfolgreich…

python python-3.x utf unicode-escapes

asked Jul 15 '21 at 10:05

PiWo

votes

0 answers

Detect encoding of HUGE files

In Java, There are couple of libraries for detecting encoding of Text files, like google's juniversalchardet and TikaEncodingDetector. Although, for huge files it would take to much time. One approach is to use these libraries on a sample of the…

java csv encoding utf-16 utf

asked Apr 16 '21 at 12:02

Oz Zafar

votes

1 answer

Conversion of Japanese "semi-voice" character

I was trying to compare two spark dataframe which contains Japanese characters and there's some characters that seem the same but actually different to the program, such as プ vs プ If you put them in utf-8 encoder: プ utf-8 = \xE3\x83\x97 プ utf-8 =…

java apache-spark-sql character-encoding cjk utf

asked Oct 09 '20 at 22:40

yihamz

votes

3 answers

Need help understanding UTF encodings

Hallo, I have noticed that when I save a text file using UTF-8 encoding (no BOM), I am able to read it perfectly using the UTF-16 encoding on C#. Now this got me a little confused cause UTF-8 only uses 8 bits, right? And utf-16 takes, well, 16 bits…

c# encoding utf-8 utf-16 utf

asked Jun 11 '11 at 04:14

Delta

4,308
2
29
37

votes

0 answers

Read local html files and convert to dataframe with python

I have a local directory on my machine with multiple html files, all with the following naming format > XXXXXXXX_XXXX-XX-XX.html with the X representing numeric characters (the number of numeric characters before the _ varies). I access all the…

python beautifulsoup html-parsing utf

asked Jul 01 '20 at 22:18

Simon

votes

2 answers

R: How to deal with replacement character � that doesn't want to disappear

I have a big data frame main_df with company_names and several variables. Some of the company_names are misspelled, have typos, or need to be changed otherwise. Therefore, I am creating a vector of unique names, using: unique_names <-…

r dataframe special-characters utf

asked Jun 24 '20 at 03:01

questionmark

votes

1 answer

Should we always use xml version="1.0" and encoding="utf-8" in XML of Android?

I have a basically question about XML in Android. This line that is shown at the top of XML files is changeable? I mean we can use for example utf-16 or another version of xml in our codes?

android xml android-studio encoding utf

asked Apr 06 '20 at 05:31

MMG

3,226
5
16
43

votes

0 answers

Why does my text file become unreadable on macOS after opening on WSL Vim?

I have a text file (refs.bib) in my Dropbox that was created using Vim on macOS. I open it on macOS Vim, the banner in the editor gives the details unix | utf-8 | bib and the file is legible. I do not make any changes and exit Vim. I then open the…

unix vim encoding windows-subsystem-for-linux utf

asked Mar 28 '20 at 15:19

rorty

votes

1 answer

Django Decoding UTF characters - \\u0411\\u0435\\u0441\\u0435\\u0434\\u043a\\u0430 - to Cyrillic strings

I am using Django 1.3. Would you be so kind and answer me one question. I am reading data from my database, where encoding is set to untf8-unicode settings.py DEFAULT_CHARSET = 'utf-8' file.py # -*- coding: utf-8 -*- def get_gift(gift_id): gift…

django decoding utf

asked May 17 '11 at 08:56

Roman

votes

0 answers

Merging xfdf into template pdf without losing some special characters (eg. ő,Ű,č)

I have an xfdf file, which is utf8 and may contain non ASCII characters. I would like to merge it with the pdf that contains the form. I tried with pdftk, and although merging happens correctly - in terms of all fields are being populated - some…

pdf character-encoding utf pdftk xfdf

asked Dec 11 '19 at 15:42

user5473535

votes

5 answers

Java UTF-16 Encoding code

The function that encodes a Unicode Code Point (Integer) to a char array (Bytes) in java is basically this: return new char[] { (char) codePoint }; Which is just a cast from the integer value to a char. I would like to know how this cast is…

java encoding character-encoding utf-16 utf

asked May 03 '11 at 20:22

skiforfun

votes

1 answer

Why is an empty string '' encoded into 2 bytes in utf-16 but 0 bytes in utf-8 or ascii?

I was just learning about encoding strings in python and after fidgeting with it a little, I got confused by the fact that the size of an empty string ('') is 0 in utf 8 and ascii but somehow 2 in utf 16? how come? print(len(''.encode('utf16'))) #…

python python-3.x utf-8 utf-16 utf

asked May 14 '19 at 02:01

Wooyoung Cho

votes

1 answer

Comparing gender emojis in UTF-16

I made a program that reads an input string, compares it to check if it's certain emoji and returns a number depending on which emoji it is. The problem comes with emojis with different genres. For example, the policeman emoji doesn't get detected.…

emoji utf-16 utf

asked Feb 22 '19 at 17:35

Jaime Fernández

votes

1 answer

Why UTF-8 encoding does not use bytes of the form 11111xxx as the first byte?

According to https://en.wikipedia.org/wiki/UTF-8, the first byte of the encoding of a character never start with bit patterns of neither 10xxxxxx nor 11111xxx. The reason for the first one is obvious: auto-synchronization. But how about the second?…

utf-8 utf

asked Feb 22 '19 at 16:17

Junekey Jeon

1,496
1
11
18

Prev 1 2 3

…

57 58 Next