Questions tagged [utf-16]

UTF-16 is a character encoding that represents Unicode code points using either 2 or 4 bytes per character.

UTF-16 is a character encoding that describes code points in byte sequences of either two or four bytes. It is therefore a variable-width character encoding.

The algorithm for encoding code points as UTF-16 is described in RFC 2781.

There are three flavors of UTF-16, little-endian, big-endian and with BOM (see ).

Related tags

1193 questions
14
votes
3 answers

How to convert Rust strings to UTF-16?

Editor's note: This code example is from a version of Rust prior to 1.0 and is not valid Rust 1.0 code, but the answers still contain valuable information. I want to pass a string literal to a Windows API. Many Windows functions use UTF-16 as the…
Gigih Aji Ibrahim
  • 405
  • 1
  • 3
  • 10
14
votes
2 answers

java string.getBytes("UTF-8") javascript equivalent

I have this string in java: "test.message" byte[] bytes = plaintext.getBytes("UTF-8"); //result: [116, 101, 115, 116, 46, 109, 101, 115, 115, 97, 103, 101] If I do the same thing in javascript: stringToByteArray: function (str) { …
user429620
13
votes
5 answers

Why doesn't Git natively support UTF-16?

Git supports several different encoding schemes, UTF-7, UTF-8, and UTF-32, as well as non-UTF ones. Given this, why doesn't it support UTF-16? There's a lot of questions that ask how to get Git to support UTF-16, but I don't think that this has been…
Zac Faragher
  • 963
  • 13
  • 26
13
votes
1 answer

Why can I not read a UTF-16 file longer than 4094 characters?

Some information: I've only tried this on Linux I've tried both with GCC (7.2.0) and Clang (3.8.1) It requires C++11 or higher to my understanding What happens when I run it I get the expected string "abcd" repeated until it hits the position of…
13
votes
7 answers

findstr or grep that autodetects chararacter encoding (UTF-16)

I want to do this: findstr /s /c:some-symbol * or the grep equivalent grep -R some-symbol * but I need the utility to autodetect files encoded in UTF-16 (and friends) and search them appropriately. My files even have the byte-ordering mark…
David Martin
  • 181
  • 1
  • 2
  • 7
13
votes
2 answers

Should I change from UTF-8 to UTF-16 to accommodate Chinese characters in my HTML?

I am using ASP.NET MVC, MS SQL and IIS. I have a few users that have used Chinese characters in their profile info. However, when I display this information is shows up as æŽå¼·è¯ but they are correct in my database. …
Aaron Salazar
  • 4,467
  • 10
  • 39
  • 54
13
votes
4 answers

Using unicode characters bigger than 2 bytes with .Net

I'm using this code to generate U+10FFFC var s = Encoding.UTF8.GetString(new byte[] {0xF4,0x8F,0xBF,0xBC}); I know it's for private-use and such, but it does display a single character as I'd expect when displaying it. The problems come when…
Earlz
  • 62,085
  • 98
  • 303
  • 499
13
votes
3 answers

Pandas read_csv and UTF-16

I have a CSV text file encoded in UTF-16 (so as to preserve Unicode characters when others use Excel) but when doing a read_csv with Pandas 0.9.0, I get this cryptic error: df =…
Brian Keegan
  • 2,208
  • 4
  • 24
  • 31
13
votes
1 answer

Using iconv to convert from UTF-16BE to UTF-8 without BOM

I'm trying to convert a UTF-16BE encoded file (byte order mark: 0xFE 0xFF) to UTF-8 using iconv like so: iconv -f UTF-16BE -t UTF-8 myfile.txt The resulting output, however, has the UTF-8 byte order mark (0xEF 0xBB 0xBF) and that is not what I…
Edward Samson
  • 2,395
  • 2
  • 26
  • 39
12
votes
2 answers

R write.csv with UTF-16 encoding

I'm having trouble outputting a data.frame using write.csv using UTF-16 character encoding. Background: I am trying to write out a CSV file from a data.frame for use in Excel. Excel Mac 2011 seems to dislike UTF-8 (if I specify UTF-8 during text…
Daniel Dickison
  • 21,832
  • 13
  • 69
  • 89
12
votes
3 answers

What is the difference between "UTF-16" and "std::wstring"?

Is there any difference between these two string storage formats?
hkBattousai
  • 10,583
  • 18
  • 76
  • 124
12
votes
3 answers

Why were the code points in the range of U+D800 to U+DFFF removed from the Unicode character set?

I am learning about UTF-16 encoding, and I have read that if you want to represent code points in the range of U+10000 to U+10FFFF, then you have to use surrogate pairs, which are in the range of U+D800 to U+DFFF. So let's say I want to encode the…
paul
  • 695
  • 7
  • 17
12
votes
3 answers

In UTF-16, UTF-16BE, UTF-16LE, is the endian of UTF-16 the computer's endianness?

UTF-16 is a two-byte character encoding. Exchanging the two bytes' addresses will produce UTF-16BE and UTF-16LE. But I find the name UTF-16 encoding exists in the Ubuntu gedit text editor, as well as UTF-16BE and UTF-16LE. With a C test program I…
hao.zhou
  • 131
  • 1
  • 1
  • 4
12
votes
3 answers

dos2unix: Binary symbol 0x04 found at line 1703

I download a file from the OECD http://stats.oecd.org/Index.aspx?datasetcode=CRS1 ('CRS 2013 data.txt') by selecting Export-> Related files. I want to work with this file in Ubuntu (14.04 LTS). When I run: dos2unix CRS\ 2013\ data.txt I…
dw8547
  • 258
  • 1
  • 2
  • 11
12
votes
1 answer

JSON.stringify() to UTF-8

Javascript uses as far as I know UTF-16 fundamentally as a standard for strings. With JSON.stringify() I can create a JSON string from an object. Is that JSON string UTF-16 encoded? Can I convert (hopefully fast) that string to UTF-8 to save…
Sebastian Barth
  • 4,079
  • 7
  • 40
  • 59