Questions tagged [utf-16]

UTF-16 is a character encoding that represents Unicode code points using either 2 or 4 bytes per character.

UTF-16 is a character encoding that describes code points in byte sequences of either two or four bytes. It is therefore a variable-width character encoding.

The algorithm for encoding code points as UTF-16 is described in RFC 2781.

There are three flavors of UTF-16, little-endian, big-endian and with BOM (see ).

Related tags

1193 questions
5
votes
4 answers

Displaying UTF-16 characters on web browser

I printed some UTF-16 encoded characters and tried to display it in Firefox and it displayed it as �. So I went to Tools->Encoding and changed the encoding from UTF-8 to UTF-16 (I also tried changing charset directly in the HTML) However, when I…
allenylzhou
  • 1,431
  • 4
  • 19
  • 36
5
votes
4 answers

wchar_t for UTF-16 on Linux?

Does it make any sense to store UTF-16 encoded text using wchar_t* on Linux? The obvious problem is that wchar_t is four bytes on Linux and UTF-16 takes usually two (or sometimes two groups of two) bytes per character. I'm trying to use a…
user708549
5
votes
3 answers

Size of wchar_t* for surrogate pair (Unicode character out of BMP) on Windows

I have encountered an interesting issue on Windows 8. I tested I can represent Unicode characters which are out of the BMP with wchar_t* strings. The following test code produced unexpected results for me: const wchar_t* s1 = L"a"; const wchar_t* s2…
Mark Vincze
  • 7,737
  • 8
  • 42
  • 81
4
votes
7 answers

Are there any dangers to working internally in UTF-8 and then converting to UTF-16 only when needed in Windows?

Visual studio tries to insist on using tchars, which when compiled with the UNICODE option then basically ends up using the wide versions of the Windows and other API. Is there then any danger to using UTF-8 internally in the application (which…
Carl
  • 43,122
  • 10
  • 80
  • 104
4
votes
1 answer

UTF-16 Encoding

Jani ALOK AshuTosh I have the XML parser which supports UTF-8 encoding only else it gives SAX parser exception. How can …
Alok Chaudhary
  • 3,481
  • 1
  • 16
  • 19
4
votes
2 answers

RE2 and UTF16 (or UCS-2)

RE2 is great. Fast and deterministic. However, it supports only UTF8. My strings are natively UTF16, and converting back and forth would kill performance. How difficult would it be to implement native UTF16 capability in RE2? How difficult would it…
MustafaM
  • 493
  • 1
  • 4
  • 14
4
votes
2 answers

Can Character represent all unicode code point?

Since Java char is 16 bit long, I am wondering how can it represent the full unicode code point? It can only represent 65536 code points, is that right?
user705414
  • 20,472
  • 39
  • 112
  • 155
4
votes
1 answer

How to Convert UTF-16 to UTF-32 and Print the Resulting wchar_t in C?

i'm trying to print out a string of UTF-16 characters. i posted this question a while back and the advice given was to convert to UTF-32 using iconv and print it as a string of wchar_t. i've done some research, and managed to code the following: //…
Edwin Lee
  • 3,540
  • 6
  • 29
  • 36
4
votes
3 answers

How can I store UTF-16 characters in a Postgres database?

I am trying to store some text (e.g. č) in a Postgres database, however when retrieving this value, it appears on screen as ?. I'm not sure why it does this, I was under the impression that it was a character that wasn't supported in UTF-8, but was…
Mr Shoubs
  • 14,629
  • 17
  • 68
  • 107
4
votes
1 answer

Wide character Windows

Windows defines the wchar_t symbol to be 16 bits long. However, the UTF-16 encoding used tells us that some symbols may actually be encoded with 4 bytes (32 bits). Does this mean that if I'm developing an application for Windows, the following…
Yippie-Ki-Yay
  • 22,026
  • 26
  • 90
  • 148
4
votes
2 answers

Firefox and UTF-16 encoding

I'm building a website with the encoding UTF-16. It means that every files (html,jsp) is encoded in UTF-18 and I set in the head of every HTML page : My index page is correctly…
user376112
  • 859
  • 5
  • 15
  • 24
4
votes
1 answer

Why was the Python Unicode internal format implemented as described in PEP 100?

http://www.python.org/dev/peps/pep-0100/ PEP 100 states that the internal format, Python Unicode, holds UTF-16 encodings, but addresses the values as UCS-2 (or UCS-4 when compiled with flag --enable-unicode=ucs4). Why wasn't UTF-16 chosen (a…
mkelley33
  • 5,323
  • 10
  • 47
  • 71
4
votes
2 answers

Error which "shouldn't happen" caused by MalformedInputException when reading file to string with UTF-16

Path file = Paths.get("New Text Document.txt"); try { System.out.println(Files.readString(file, StandardCharsets.UTF_8)); System.out.println(Files.readString(file, StandardCharsets.UTF_16)); } catch (Exception e) { …
H.v.M.
  • 1,348
  • 3
  • 16
  • 42
4
votes
3 answers

Detect (or best guess of) incoming string encoding in Java

I was wondering if there are known methods to detect (or give a best guess of) the encoding of a particular string in Java. I know that you always need some additional meta-data to tell what the encoding is, and there are best practices etc., but…
SuPra
  • 8,488
  • 4
  • 37
  • 30
4
votes
2 answers

Get UTF-16 code unit at a given index in ABAP

I want to get the UTF-16 code unit at a given index in ABAP. Same can be done in JavaScript with charCodeAt(). For example "d".charCodeAt(); will give back 100. Is there a similar functionality in ABAP?
schmelto
  • 427
  • 5
  • 18