Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
12
votes
7 answers

How to solve UnicodeDecodeError in Python 3.6?

I am switched from Python 2.7 to Python 3.6. I have scripts that deal with some non-English content. I usually run scripts via Cron and also in Terminal. I had UnicodeDecodeError in my Python 2.7 scripts and I solved by this. # encoding=utf8 …
Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
12
votes
2 answers

Encoding for Multilingual .py Files

I am writing a .py file that contains strings from multiple charactersets, including English, Spanish, and Russian. For example, I have something like: string_en = "The quick brown fox jumped over the lazy dog." string_es = "El veloz murciélago…
Katrina
  • 409
  • 5
  • 16
12
votes
2 answers

Converting Python 3 String of Bytes of Unicode - `str(utf8_encoded_str)` back to unicode

Well, let me introduce the problem first. I've got some data via POST/GET requests. The data were UTF-8 encoded string. Little did I know that, and converted it just by str() method. And now I have full database of "nonsense data" and couldn't find…
darkless
  • 1,304
  • 11
  • 19
12
votes
3 answers

Does an nvarchar always store each character in two bytes?

I had (perhaps naively) assumed that in SQL Server, an nvarchar would store each character in two bytes. But this does not always seem to be the case. The documentation out there suggests that some characters might take more bytes. Does someone have…
Rich
  • 2,207
  • 1
  • 23
  • 27
12
votes
6 answers

Best way to convert between [Char] and [Word8]?

I'm new to Haskell and I'm trying to use a pure SHA1 implementation in my app (Data.Digest.Pure.SHA) with a JSON library (AttoJSON). AttoJSON uses Data.ByteString.Char8 bytestrings, SHA uses Data.ByteString.Lazy bytestrings, and some of my string…
cmars232
  • 121
  • 1
  • 3
12
votes
1 answer

Python 3 UnicodeDecodeError - How do I debug UnicodeDecodeError?

I have a text file which the publisher (the US Securities Exchange Commission) asserts is encoded in UTF-8 (https://www.sec.gov/files/aqfs.pdf, section 4). I'm processing the lines with the following code: def tags(filename): """Yield Tag…
MikeRand
  • 4,788
  • 9
  • 41
  • 70
12
votes
2 answers

Loading special characters with PyYaml

I'm working on loading a list of emoji characters in a simple python 3.6 script. The YAML structure is essentially as follows: - - - My python script looks like this: import yaml f = open('emojis.yml') EMOJIS = yaml.load(f) f.close() I'm…
Quinn Stearns
  • 162
  • 1
  • 8
12
votes
4 answers

How to convert emoticons to its UTF-32/escaped unicode?

I am working on a chatting application in WPF and I want to use emoticons in it. I am working on WPF app. I want to read emoticons which are coming from Android/iOS devices and show respective images. On WPF, I am getting a black Emoticon looking…
Joker_37
  • 839
  • 2
  • 8
  • 20
12
votes
1 answer

Pytesseract foreign language extraction using python

I am using Python 2.7, Pytesseract-0.1.7 and Tesseract-ocr 3.05.01 on a Windows machine. I tried to extract text for Korean and Russian languages, and I am positive that I extracted. And now I need to compare with the string and string got…
Deepan Raj
  • 385
  • 1
  • 5
  • 16
12
votes
6 answers

Spring-Boot, Can't save unicode string in MySql using spring-data JPA

I have my application.properties set up like this : spring.datasource.username = root spring.datasource.password = root spring.datasource.url =…
erluxman
  • 18,155
  • 20
  • 92
  • 126
12
votes
6 answers

Internationalization in MFC

It's finally (after years of postponing) the time to localize my app in a few other languages other than English. The first challenge is to design the integration into my C++ / MFC application that has dozens of dialogs and countless strings. I came…
Cosmin
  • 6,623
  • 3
  • 26
  • 28
12
votes
1 answer

python unicode rendering: how to know if a unicode character is missing from the font

In Python when I render a unicode character, e.g. a Chinese character, with a selected font, sometimes the font is incomplete regarding the common unicode characters, and can't render the unicode character in question. In those cases, if I call the…
MichM
  • 886
  • 1
  • 12
  • 28
12
votes
2 answers

What are all the Japanese whitespace characters?

I need to split a string and extract words separated by whitespace characters.The source may be in English or Japanese. English whitespace characters include tab and space, and Japanese text uses these too. (IIRC, all widely-used Japanese character…
Mason
  • 5,071
  • 4
  • 25
  • 24
12
votes
3 answers

What is the difference between "UTF-16" and "std::wstring"?

Is there any difference between these two string storage formats?
hkBattousai
  • 10,583
  • 18
  • 76
  • 124
12
votes
5 answers

Unicode characters not showing in System.Windows.Forms.TextBox

These characters show fine when I cut-and-paste them here from the VisualStudio debugger, but both in the debugger, and in the TextBox where I am trying to display this text, it just shows squares. 说明\r\n海流受季风影响,3-9 月份其流向主要向北,流速为2 节,有时达3 节;10 月至次年4…
Sean
  • 1,373
  • 2
  • 14
  • 26