Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

defines abstract CodePoints and their interactions. It also defines multiple s for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

  • (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
  • Used only for international domain names. (historical contenders were utf-5 and utf-6)
  • GB18030 is the official chinese encoding.
  • UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
  • This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

  • () Early adopters who embraced when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
  • (identical to ucs4 aka modern ) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

857 questions
3
votes
2 answers

How to combine two code points to get one?

I know that unicode code point for Á is U+00C1. I read on internet and many forums and articles that I can also make an Á by combining characters ´ (unicode: U+00B4) and A (unicode: U+0041). My question is simple. How to do it? I tried something…
Filip
  • 401
  • 8
3
votes
3 answers

Why does my html email appear differently when sent by AWS SES?

I've been working on some custom html email templates. I'm having some trouble with my emails appearing differently when they are sent by different email services. I'm using AWS SES to send these emails to clients. I've been using Postdrop to send…
Sam Sabin
  • 553
  • 1
  • 6
  • 19
3
votes
0 answers

What type should I be using to store and print these unicode characters? (C++)

I'm writing a CLI version of the game Snake in C++, and I want to make the snake look cool using different unicode characters to represent different states of the snake. I am using the following characters : ╔, ╗, ╚, ╝, ═, ║, ⩓, ⩔, ⪡, ⪢. I…
Harry
  • 193
  • 1
  • 12
3
votes
1 answer

UTF8 to UTF16 conversion using std::filesystem::path

Starting from C++11 one can convert UTF8 to UTF16 wchar_t (at least on Windows, where wchar_t is 16 bit wide) using std::codecvt_utf8_utf16: std::wstring utf8ToWide( const char* utf8 ) { std::wstring_convert>…
Fedor
  • 17,146
  • 13
  • 40
  • 131
3
votes
2 answers

How to represent this utf-8 encoded string in Rust?

On this RFC: https://www.rfc-editor.org/rfc/rfc7616#page-19 at page 19, there's this example of a text encoded in UTF-8: J U+00E4 s U+00F8 n D o e 4A C3A4 73 C3B8 6E 20 44 6F 65 How do I represent it in a Rust String? I tried…
Gatonito
  • 1,662
  • 5
  • 26
  • 55
3
votes
2 answers

Why is '\u{1D11E}'.charAt(0) not equal to '\u{1D11E}'?

When I'm trying to evaluate this expression in console I have false as result, why? console.log('\u{1D11E}'.charAt(0) === '\u{1D11E}')
Json Prime
  • 180
  • 1
  • 10
3
votes
2 answers

TypeError: Unicode-objects must be encoded before hashing in Hashlib Function

I have checked out all of the other solutions to the same problem on stackoverflow and also tried them, but nothing seemed to work. I am simply posting links here instead of the code as the code is huge and it would be less interactive. Link to the…
3
votes
1 answer

PHP DOMDocument Japanese Character encoding issue

I have a file called: ニューヨーク・ヤンキース-チケット-200x225.jpg I am able to successfully do this with my PHP code: if (file_exists(ABSPATH . 'ニューヨーク・ヤンキース-チケット-200x225.jpg')) { echo 'yes'; } However, when I parse my content using DOMDocument, that…
3
votes
1 answer

Xcode keeps guessing and interpreting with wrong encoding

I am using Xcode 4.0.2, the latest release of Xcode. All my projects or standelone source codes are in UTF-8 encoding. But when I open some source file (C/C++/Objective C), all text is interpreted in Mac OS Roman encoding and I don't know why. I've…
maskov1
  • 31
  • 3
3
votes
1 answer

Python: bz2 and lzma in mode 'wt' don't write the BOM (while gzip does). Why?

The following code writes a compressed text file using gzip, bz2 and lzma, then reads and prints its binary content. import bz2 import gzip import lzma import os def test(encoding): print(encoding) for module in [gzip, bz2, lzma]: …
janluke
  • 1,567
  • 1
  • 15
  • 19
3
votes
1 answer

How to create UTF characters dynamically with JavaScript

I'm trying to use a variable and \u to create an UTF character with Node.js. var code = '0045'; console.log('\u0045', '\u' + code); But the output becomes E u0045 How do I make it E E How do I create the character and store it in a variable?
tirithen
  • 3,219
  • 11
  • 41
  • 65
3
votes
1 answer

What is the list of python settings that affect encoding, decoding, and printing?

When I run into unicode printing problems, I want to know what I should check. In my particular case, I'm using an installed module that is printing unicode encoded characters using the wrong codec. There are several disparate places that affect…
JamesThomasMoon
  • 6,169
  • 7
  • 37
  • 63
3
votes
1 answer

Unable to display a degree char (°) in a SVG

I'm trying to display a degree character (°) in a SVG manipulated by D3.js. So far I've tried different charsets but I'm always getting a � instead and the following character codes simply display themselves as regular text: ° or °. I'm…
VincentDM
  • 469
  • 6
  • 17
3
votes
1 answer

How to know number of bytes of a Binary File?

How to count number of bytes of this binary file (t.dat) without running this code (as a theoretical question) ? Assuming that you run the following program on Windows using the default ASCII encoding. public class Bin { public static void…
dave
  • 61
  • 6
3
votes
5 answers

Invalid URI with Chinese characters (Java)

Having trouble setting up a URL connection with Chinese characters in the URL. It works with Latin characters: String xstr = "维也纳恩斯特哈佩尔球场" ; URI uri = new URI("http","ajax.googleapis.com","/ajax/services/language/detect","v=1.0&q="+xstr,null); …
Joe Knapp
  • 322
  • 2
  • 9