Questions tagged [utf-16]

UTF-16 is a character encoding that represents Unicode code points using either 2 or 4 bytes per character.

UTF-16 is a character encoding that describes code points in byte sequences of either two or four bytes. It is therefore a variable-width character encoding.

The algorithm for encoding code points as UTF-16 is described in RFC 2781.

There are three flavors of UTF-16, little-endian, big-endian and with BOM (see ).

Related tags

1193 questions
19
votes
3 answers

Converting xml from UTF-16 to UTF-8 using PowerShell

What's the easiest way to convert XML from UTF16 to a UTF8 encoded file?
David Gardiner
  • 16,892
  • 20
  • 80
  • 117
19
votes
3 answers

How to convert string to unicode(UTF-8) string in Swift?

How to convert string to unicode(UTF-8) string in Swift? In Objective I could write smth like that: NSString *str = [[NSString alloc] initWithUTF8String:[strToDecode cStringUsingEncoding:NSUTF8StringEncoding]]; how to do smth similar in Swift?
Dirder
  • 363
  • 2
  • 3
  • 11
18
votes
5 answers

javascript and string manipulation w/ utf-16 surrogate pairs

I'm working on a twitter app and just stumbled into the world of utf-8(16). It seems the majority of javascript string functions are as blind to surrogate pairs as I was. I've got to recode some stuff to make it wide character aware. I've got this…
BentFX
  • 2,746
  • 5
  • 25
  • 30
18
votes
2 answers

Which Languages Does UTF-8 Not Support?

I'm working on internationalizing one of my programs for work. I'm trying to use foresight to avoid possible issues or redoing the process down the road. I see references for UTF-8, UTF-16 and UTF-32. My question is two parts: What languages does…
James Oravec
  • 19,579
  • 27
  • 94
  • 160
17
votes
1 answer

utf-16 file seeking in python. how?

For some reason i can not seek my utf16 file. It produces 'UnicodeException: UTF-16 stream does not start with BOM'. My code: f = codecs.open(ai_file, 'r', 'utf-16') seek = self.ai_map[self._cbClass.Text] #seek is valid int f.seek(seek) while…
marrat
  • 534
  • 1
  • 6
  • 14
17
votes
5 answers

Unicode string normalization in C/C++

Am wondering how to normalize strings (containing utf-8/utf-16) in C/C++. In .NET there is a function String.Normalize . I used UTF8-CPP in the past but it does not provide such a function. ICU and Qt provide string normalization but I prefer…
Ghassen Hamrouni
  • 3,138
  • 2
  • 20
  • 31
17
votes
2 answers

Which encoding does Java uses UTF-8 or UTF-16?

I've already read the following posts: What is the Java's internal represention for String? Modified UTF-8? UTF-16? https://docs.oracle.com/javase/8/docs/api/java/lang/String.html Now consider the code given below: public static void main(String[]…
Nitin Bhardwaj
  • 213
  • 1
  • 2
  • 6
17
votes
4 answers

Java charAt used with characters that have two code units

From Core Java, vol. 1, 9th ed., p. 69: The character ℤ requires two code units in the UTF-16 encoding. Calling String sentence = "ℤ is the set of integers"; // for clarity; not in book char ch = sentence.charAt(1) doesn't return a space but the…
Patrick Brinich-Langlois
  • 1,381
  • 1
  • 15
  • 29
16
votes
3 answers

Sorting the characters in a UTF-16 string in Java

TLDR Java uses two characters to represent UTF-16. Using Arrays.sort (unstable sort) messes with character sequencing. Should I convert char[] to int[] or is there a better way? Details Java represents a character as UTF-16. But the Character class…
dingy
  • 183
  • 7
16
votes
4 answers

How do I compare each character of a String while accounting for characters with length > 1?

I have a variable string that might contain any unicode character. One of these unicode characters is the han . The thing is that this "han" character has "".length() == 2 but is written in the string as a single character. Considering the code…
Fagner Brack
  • 2,365
  • 4
  • 33
  • 69
15
votes
3 answers

UTF-16 string terminator

What is the string terminator sequence for a UTF-16 string? EDIT: Let me rephrase the question in an attempt to clarify. How's does the call to wcslen() work?
Ray
  • 153
  • 1
  • 1
  • 4
15
votes
2 answers

Can wprintf output be properly redirected to UTF-16 on Windows?

In a C program I'm using wprintf to print Unicode (UTF-16) text in a Windows console. This works fine, but when the output of the program is redirected to a log file, the log file has a corrupted UTF-16 encoding. When redirection is done in a…
15
votes
6 answers

UTF-16 to UTF-8 conversion (for scripting in Windows)

what is the best way to convert a UTF-16 files to UTF-8? I need to use this in a cmd script.
Grzenio
  • 35,875
  • 47
  • 158
  • 240
15
votes
4 answers

Strange unicode characters when reading in file in node.js app

I am attempting to write a node app that reads in a set of files, splits them into lines, and puts the lines into an array. Pretty simple. It works on quite a few files except some SQL files that I am working with. For some reason I seem to be…
d512
  • 32,267
  • 28
  • 81
  • 107
15
votes
2 answers

How do I create a string with a surrogate pair inside of it?

I saw this post on Jon Skeet's blog where he talks about string reversing. I wanted to try the example he showed myself, but it seems to work... which leads me to believe that I have no idea how to create a string that contains a surrogate pair…
michael
  • 14,844
  • 28
  • 89
  • 177