Questions tagged [utf-16]

UTF-16 is a character encoding that represents Unicode code points using either 2 or 4 bytes per character.

UTF-16 is a character encoding that describes code points in byte sequences of either two or four bytes. It is therefore a variable-width character encoding.

The algorithm for encoding code points as UTF-16 is described in RFC 2781.

There are three flavors of UTF-16, little-endian, big-endian and with BOM (see ).

Related tags

1193 questions
5
votes
2 answers

Converting a UTF-16LE Elixir bitstring into an Elixir String

Given an Elixir bitstring encoded in UTF-16LE: <<68, 0, 101, 0, 118, 0, 97, 0, 115, 0, 116, 0, 97, 0, 116, 0, 111, 0, 114, 0, 0, 0>> how can I get this converted into a readable Elixir String (it spells out "Devastator")? The closest I've gotten is…
user701847
  • 337
  • 3
  • 15
5
votes
2 answers

UTF-16 decoder not working as expected

I have a part of my Unicode library that decodes UTF-16 into raw Unicode code points. However, it isn't working as expected. Here's the relevant part of the code (omitting UTF-8 and string manipulation stuff): typedef struct string { unsigned…
Delan Azabani
  • 79,602
  • 28
  • 170
  • 210
5
votes
2 answers

How to best deal with Windows' 16-bit wchar_t ugliness?

I'm writing a wrapper layer to be used with mingw which provides the application with a virtual UTF-8 environment. Functions which deal with filenames are wrappers which convert from UTF-8 and call the corresponding "_w" functions, and so on. The…
R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
5
votes
1 answer

fatal error: high- and low-surrogate code points are not valid Unicode scalar values

Sometimes while initializing a UnicodeScalar with a value like 57292 yields the following error: fatal error: high- and low-surrogate code points are not valid Unicode scalar values What is this error, why does it occur and how can I prevent it in…
Vatsal Manot
  • 17,695
  • 9
  • 44
  • 80
5
votes
3 answers

Javascript: unicode character to BYTE based hex escape sequence (NOT surrogates)

In javascript I am trying to make unicode into byte based hex escape sequences that are compatible with C: ie. becomes: \xF0\x9F\x98\x84 (correct) NOT javascript surrogates, not \uD83D\uDE04 (wrong) I cannot figure out the math relationship…
ck_
  • 3,353
  • 5
  • 31
  • 33
5
votes
1 answer

Opening and reading UTF-16 files in Python

Recently I have been having trouble opening specific UTF-16 encoded files in Python. I have tried the following: import codecs f = codecs.open('filename.data', 'r', 'utf-16-be') contents = f.read() but I get the following error: UnicodeDecodeError:…
DJMcCarthy12
  • 3,819
  • 8
  • 28
  • 34
5
votes
3 answers

grep and tail -f for a UTF-16 binary file - trying to use simple awk

How can I achieve the equivalent of: tail -f file.txt | grep 'regexp' to only output the buffered lines that match a regular expression such as 'Result' from the file type: $ file file.txt file.txt:Little-endian UTF-16 Unicode text, with CRLF line…
Alexander McFarlane
  • 10,643
  • 9
  • 59
  • 100
5
votes
2 answers

C#: how to get first character of a string?

We already have a question about getting the first 16-bit char of a string. This includes the question code: MyString.ToCharArray[0] and accepted answer code: MyString[0] I guess there are some uses for that, but when the string contains text we…
hippietrail
  • 15,848
  • 18
  • 99
  • 158
5
votes
4 answers

How to reverse a string that contains surrogate pairs

I have written this method to reverse a string public string Reverse(string s) { if(string.IsNullOrEmpty(s)) return s; TextElementEnumerator enumerator = …
Sachin Kainth
  • 45,256
  • 81
  • 201
  • 304
5
votes
2 answers

Why does JQuery only display HTML characters when enclosed in other tags?

I'm curious about why this JQuery renders the full block HTML character: var html = $('
'); $("body").append(html) But this doesn't: var html = $('█'); $("body").append(html) Is there a way to render one single special…
user2950747
  • 695
  • 1
  • 6
  • 19
5
votes
1 answer

UnicodeEncodeError: 'charmap' codec can't encode character character maps to

I have a problem with writing to file in unicode. I am using python 2.7.3. It gives me such an error: UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 1006: character maps to Here is a sample of my code:…
yozhik
  • 4,644
  • 14
  • 65
  • 98
5
votes
2 answers

Will UTF-8 strings always be shorter than UTF-16?

If I have 2 strings of the same text, one UTF-8, and the other UTF-16. Is it safe to assume the UTF-8 string will always be smaller, or the same size, as the UTF-16 one? (byte wise)
Josh
  • 6,046
  • 11
  • 52
  • 83
5
votes
2 answers

VS 2012 Encoding in the declaration 'utf-16' does not match document 'utf-8'

When I open Visual Studio 2012, I am greeted with the message "Visual Studio The encoding in the declaration 'utf-16' does not match the encoding of the document 'utf-8'". Does anyone know why this might be happening? Or what troubleshooting I…
Ryan Gates
  • 4,501
  • 6
  • 50
  • 90
5
votes
2 answers

How to convert UTF8 string to UTF16

I'm getting a UTF8 string by processing a request sent by a client application. But the string is really UTF16. What can I do to get it into my local string is a letter followed by \0 character? I need to convert that String into UTF16. Sample…
dinesh707
  • 12,106
  • 22
  • 84
  • 134
5
votes
3 answers

How can I check for the existence of UTF-16 filenames in Perl?

I have a textfile encoded in UTF-16. Each line contains a number of columns separated by tabs. For those who care, the file is a playlist TXT export from iTunes. Column #27 contains a filename. I am reading it using Perl 5.8.8 in Linux using code…
blt04
  • 692
  • 5
  • 12