Questions tagged [utf-16]

UTF-16 is a character encoding that represents Unicode code points using either 2 or 4 bytes per character.

UTF-16 is a character encoding that describes code points in byte sequences of either two or four bytes. It is therefore a variable-width character encoding.

The algorithm for encoding code points as UTF-16 is described in RFC 2781.

There are three flavors of UTF-16, little-endian, big-endian and with BOM (see ).

Related tags

1193 questions
12
votes
1 answer

How to write 3 bytes unicode literal in Java?

I'd like to write unicode literal U+10428 in Java. http://www.marathon-studios.com/unicode/U10428/Deseret_Small_Letter_Long_I I tried with '\u10428' and it doesn't compile.
kawty
  • 1,656
  • 15
  • 22
12
votes
6 answers

Writing utf16 to file in binary mode

I'm trying to write a wstring to file with ofstream in binary mode, but I think I'm doing something wrong. This is what I've tried: ofstream outFile("test.txt", std::ios::out | std::ios::binary); wstring hello = L"hello"; outFile.write((char *)…
Cactuar
  • 395
  • 2
  • 6
  • 14
12
votes
3 answers

Encode/Decode std::string to UTF-16

I have to handle a file format (both read from and write to it) in which strings are encoded in UTF-16 (2 bytes per character). Since characters out of the ASCII table are rarely used in the application domain, all of the strings in my C++ model…
Peter
  • 237
  • 1
  • 2
  • 8
11
votes
7 answers

Is there a standard technique for packing binary data into a UTF-16 string?

(In .NET) I have arbitrary binary data stored in in a byte[] (an image, for example). Now, I need to store that data in a string (a "Comment" field of a legacy API). Is there a standard technique for packing this binary data into a string? By…
Ðаn
  • 10,934
  • 11
  • 59
  • 95
11
votes
2 answers

How was the position of the Surrogates Area (UTF-16) chosen?

Was the position of UTF-16 surrogates area (U+D800..U+DFFF) chosen at random or does it have some logical reason, that it is on this place?
sid_com
  • 24,137
  • 26
  • 96
  • 187
11
votes
3 answers

UTF-16 Encoding in Java versus C#

I am trying to read a String in UTF-16 encoding scheme and perform MD5 hashing on it. But strangely, Java and C# are returning different results when I try to do it. The following is the piece of code in Java: public static void main(String[] args)…
rkg
  • 5,559
  • 8
  • 37
  • 50
11
votes
4 answers

std::wstring length

What is the result of std::wstring.length() function, the length in wchar_t(s) or the length in symbols? And why? TCHAR r2[3]; r2[0] = 0xD834; // D834, DD1E - musical G clef r2[1] = 0xDD1E; // r2[2] = 0x0000; // '/0' std::wstring r =…
Julian Popov
  • 17,401
  • 12
  • 55
  • 81
11
votes
5 answers

How does Microsoft handle the fact that UTF-16 is a variable length encoding in their C++ standard library implementation

Having a variable length encoding is indirectly forbidden in the standard. So I have several questions: How is the following part of the standard handled? 17.3.2.1.3.3 Wide-character sequences A wide-character sequence is an array object (8.3.4) A…
Šimon Tóth
  • 35,456
  • 20
  • 106
  • 151
11
votes
4 answers

Is there a drastic difference between UTF-8 and UTF-16

I call a webservice, that gives me back a response xml that has UTF-8 encoding. I checked that in java using getAllHeaders() method. Now, in my java code, I take that response and then do some processing on it. And later, pass it on to a different…
Kraken
  • 23,393
  • 37
  • 102
  • 162
11
votes
2 answers

Why does Powershell file concatenation convert UTF8 to UTF16?

I am running the following Powershell script to concatenate a series of output files into a single CSV file. whidataXX.htm (where xx is a two digit sequential number) and the number of files created varies from run to run. $metadataPath =…
dwwilson66
  • 6,806
  • 27
  • 72
  • 117
11
votes
7 answers

Dummy's guide to Unicode

Could anyone give me a concise definitions of Unicode UTF7 UTF8 UTF16 UTF32 Codepages How they differ from Ascii/Ansi/Windows 1252 I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations…
Arec Barrwin
  • 61,343
  • 9
  • 29
  • 25
10
votes
4 answers

Does the Unicode Consortium Intend to make UTF-16 run out of characters?

The current version of UTF-16 is only capable of encoding 1,112,064 different numbers(code points); 0x0-0x10FFFF. Does the Unicode Consortium Intend to make UTF-16 run out of characters? i.e. make a code point > 0x10FFFF If not, why would anyone…
GlassGhost
  • 16,906
  • 5
  • 32
  • 45
10
votes
2 answers

Is the XML declaration tag case sensitive?

I have what is probably a really simple, studid question but I can't find an answer to it anywhere and I need to be pretty sure about this. I have various XML files from various vendors. One of the vendors provide me an XML file with japanese…
Frank V
  • 25,141
  • 34
  • 106
  • 144
10
votes
4 answers

Python UTF-16 CSV reader

I have a UTF-16 CSV file which I have to read. Python csv module does not seem to support UTF-16. I am using python 2.7.2. CSV files I need to parse are huge size running into several GBs of data. Answers for John Machin questions below print…
venky
  • 125
  • 1
  • 2
  • 9
10
votes
1 answer

How can I match emoji with an R regex?

I want to determine which elements of my vector contain emoji: x = c('', 'no', '', '', 'no', '', '䨺', '감사') x # [1] "\U0001f602" "no" "\U0001f379" "\U0001f600" "no" "\U0001f61b" "䨺" "감사" Related posts only cover other…
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198