Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

defines abstract CodePoints and their interactions. It also defines multiple s for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

  • (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
  • Used only for international domain names. (historical contenders were utf-5 and utf-6)
  • GB18030 is the official chinese encoding.
  • UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
  • This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

  • () Early adopters who embraced when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
  • (identical to ucs4 aka modern ) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

857 questions
2
votes
1 answer

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xbe in position 2: invalid start byte

Do you know how I can fix this problem in PyTorch 1.9? File "main.py", line 138, in main checkpoint = torch.load(args.resume) File "/scratch3/venv/fashcomp/lib/python3.8/site-packages/torch/serialization.py", line 608, in load return…
Mona Jalal
  • 34,860
  • 64
  • 239
  • 408
2
votes
2 answers

How to escape unicode special chars in string and write it to UTF encoded file

What I aim to achieve is to: string like: Bitte überprüfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente. convert to: 'Bitte \u00FCberpr\u00FCfen Sie, ob die Dokumente erfolgreich…
PiWo
  • 590
  • 2
  • 8
  • 17
2
votes
0 answers

Detect encoding of HUGE files

In Java, There are couple of libraries for detecting encoding of Text files, like google's juniversalchardet and TikaEncodingDetector. Although, for huge files it would take to much time. One approach is to use these libraries on a sample of the…
Oz Zafar
  • 21
  • 3
2
votes
1 answer

Conversion of Japanese "semi-voice" character

I was trying to compare two spark dataframe which contains Japanese characters and there's some characters that seem the same but actually different to the program, such as プ vs プ If you put them in utf-8 encoder: プ utf-8 = \xE3\x83\x97 プ utf-8 =…
yihamz
  • 29
  • 4
2
votes
3 answers

Need help understanding UTF encodings

Hallo, I have noticed that when I save a text file using UTF-8 encoding (no BOM), I am able to read it perfectly using the UTF-16 encoding on C#. Now this got me a little confused cause UTF-8 only uses 8 bits, right? And utf-16 takes, well, 16 bits…
Delta
  • 4,308
  • 2
  • 29
  • 37
2
votes
0 answers

Read local html files and convert to dataframe with python

I have a local directory on my machine with multiple html files, all with the following naming format > XXXXXXXX_XXXX-XX-XX.html with the X representing numeric characters (the number of numeric characters before the _ varies). I access all the…
Simon
  • 21
  • 6
2
votes
2 answers

R: How to deal with replacement character � that doesn't want to disappear

I have a big data frame main_df with company_names and several variables. Some of the company_names are misspelled, have typos, or need to be changed otherwise. Therefore, I am creating a vector of unique names, using: unique_names <-…
questionmark
  • 335
  • 1
  • 13
2
votes
1 answer

Should we always use xml version="1.0" and encoding="utf-8" in XML of Android?

I have a basically question about XML in Android. This line that is shown at the top of XML files is changeable? I mean we can use for example utf-16 or another version of xml in our codes?
MMG
  • 3,226
  • 5
  • 16
  • 43
2
votes
0 answers

Why does my text file become unreadable on macOS after opening on WSL Vim?

I have a text file (refs.bib) in my Dropbox that was created using Vim on macOS. I open it on macOS Vim, the banner in the editor gives the details unix | utf-8 | bib and the file is legible. I do not make any changes and exit Vim. I then open the…
rorty
  • 123
  • 3
2
votes
1 answer

Django Decoding UTF characters - \\u0411\\u0435\\u0441\\u0435\\u0434\\u043a\\u0430 - to Cyrillic strings

I am using Django 1.3. Would you be so kind and answer me one question. I am reading data from my database, where encoding is set to untf8-unicode settings.py DEFAULT_CHARSET = 'utf-8' file.py # -*- coding: utf-8 -*- def get_gift(gift_id): gift…
Roman
  • 21
  • 1
2
votes
0 answers

Merging xfdf into template pdf without losing some special characters (eg. ő,Ű,č)

I have an xfdf file, which is utf8 and may contain non ASCII characters. I would like to merge it with the pdf that contains the form. I tried with pdftk, and although merging happens correctly - in terms of all fields are being populated - some…
2
votes
5 answers

Java UTF-16 Encoding code

The function that encodes a Unicode Code Point (Integer) to a char array (Bytes) in java is basically this: return new char[] { (char) codePoint }; Which is just a cast from the integer value to a char. I would like to know how this cast is…
skiforfun
  • 21
  • 1
  • 2
2
votes
1 answer

Why is an empty string '' encoded into 2 bytes in utf-16 but 0 bytes in utf-8 or ascii?

I was just learning about encoding strings in python and after fidgeting with it a little, I got confused by the fact that the size of an empty string ('') is 0 in utf 8 and ascii but somehow 2 in utf 16? how come? print(len(''.encode('utf16'))) #…
2
votes
1 answer

Comparing gender emojis in UTF-16

I made a program that reads an input string, compares it to check if it's certain emoji and returns a number depending on which emoji it is. The problem comes with emojis with different genres. For example, the policeman emoji doesn't get detected.…
2
votes
1 answer

Why UTF-8 encoding does not use bytes of the form 11111xxx as the first byte?

According to https://en.wikipedia.org/wiki/UTF-8, the first byte of the encoding of a character never start with bit patterns of neither 10xxxxxx nor 11111xxx. The reason for the first one is obvious: auto-synchronization. But how about the second?…
Junekey Jeon
  • 1,496
  • 1
  • 11
  • 18