Questions tagged [utf-16]

UTF-16 is a character encoding that represents Unicode code points using either 2 or 4 bytes per character.

UTF-16 is a character encoding that describes code points in byte sequences of either two or four bytes. It is therefore a variable-width character encoding.

The algorithm for encoding code points as UTF-16 is described in RFC 2781.

There are three flavors of UTF-16, little-endian, big-endian and with BOM (see ).

Related tags

1193 questions
9
votes
5 answers

Java implicit conversion of int to byte

I am about to start working on something the requires reading bytes and creating strings. The bytes being read represent UTF-16 strings. So just to test things out I wanted to convert a simple byte array in UTF-16 encoding to a string. The first…
DaveJohnston
  • 10,031
  • 10
  • 54
  • 83
9
votes
3 answers

Unicode BOM for UTF-16LE vs UTF32-LE

It seems like there's an ambiguity between the Byte Order Marks used for UTF16-LE and UTF-32LE. In particular, consider a file that contains the following 8 bytes: FF FE 00 00 00 00 00 00 How can I tell if this file contains: The UTF16-LE BOM…
Edward Loper
  • 15,374
  • 7
  • 43
  • 52
9
votes
8 answers

Detect UTF-16 file content

Is it possible to know if a file has Unicode (16-byte per char) or 8-bit ASCII content?
Franck Freiburger
  • 26,310
  • 20
  • 70
  • 95
9
votes
1 answer

UTF-16 to UTF-8 conversion in JavaScript

I have Base64 encoded data that is in UTF-16 I am trying to decode the data but most libraries only support UTF-8. I believe I have to drop the null bites but I am unsure how. Currently I am using David Chambbers Polyfill for Base64, but I have also…
Don P
  • 570
  • 1
  • 5
  • 12
8
votes
2 answers

How to get a reliable unicode character count in Python?

Google App Engine uses Python 2.5.2, apparently with UCS4 enabled. But the GAE datastore uses UTF-8 internally. So if you store u'\ud834\udd0c' (length 2) to the datastore, when you retrieve it, you get '\U0001d10c' (length 1). I'm trying to count…
Travis
  • 2,961
  • 4
  • 22
  • 29
8
votes
1 answer

Conversion from wstring to u16string and back (standard conform) in C++17 / C++20

My main platform is Windows which is the reason why I use internally UTF-16 (mostly BMP strings). I would like to use console output for these strings. Unfortunately there is no std::u16cout or std::u8cout so I need to use std::wcout. Therefore I…
Bernd
  • 2,113
  • 8
  • 22
8
votes
1 answer

Python3 reading mixed text/binary data line-by-line

I need to parse a file which has a UTF-16 text header and followed directly by binary data. To be able to read the binary data, I open the file in "rb" mode, then, for reading the header, wrap it into a io.TextIOWrapper(). The problem is that when I…
itecMemory
  • 301
  • 2
  • 8
8
votes
0 answers

SonarQube - Unable to analyse xml and xsd file, with UTF-16 encoding

I'm using sonarqube (version 5.6.7) and sonar-scanner (version 3.0.3.778) for analysing some documents. Among these documents there are also .xml and .xsd files with econding UTF-16. When I launch my sonar-scanner command from command line, with…
Nicomedes E.
  • 1,326
  • 5
  • 18
  • 27
8
votes
2 answers

How to force UTF-8 in node js with exec process?

I know the solution is very simple, but it's an hour I'm banging my head. In Windows 10, if i launch the command "dir", i got this result: Il volume nell'unità D non ha etichetta. in Node js i try to exec the dir command in this way: var child =…
Janka
  • 1,908
  • 5
  • 20
  • 41
8
votes
2 answers

Converting wstring to lower case

I want to convert wstring into lower case. I found that there are a lot of answer using locale info. Is there any function like ToLower() for wstring also?
msing
  • 81
  • 1
  • 1
  • 2
8
votes
1 answer

Truncated Read With UTF-16-Encoded Text in C++

My goal is to convert external input sources to a common, UTF-8 internal encoding, since it is compatible with many libraries I use (such as RE2) and is compact. Since I do not need to do string slicing except with pure ASCII, UTF-8 is an ideal…
Alex Huszagh
  • 13,272
  • 3
  • 39
  • 67
8
votes
4 answers

Looking for a good 64 bit hash for file paths in UTF16

I have a Unicode / UTF-16 encoded path. the path delimiters is U+005C '\'. The paths are null-terminated root relative windows file system paths, e.g. "\windows\system32\drivers\myDriver32.sys" I want to hash this path into a 64-bit unsigned…
Dominik Weber
  • 711
  • 5
  • 13
8
votes
1 answer

How to convert from utf-16 to utf-32 on Linux with std library?

On MSVC converting utf-16 to utf-32 is easy - with C11's codecvt_utf16 locale facet. But in GCC (gcc (Debian 4.7.2-5) 4.7.2) seemingly this new feature hasn't been implemented yet. Is there a way to perform such conversion on Linux without iconv…
Al Berger
  • 1,048
  • 14
  • 35
8
votes
5 answers

Is it possible to reliably auto-decode user files to Unicode? [C#]

I have a web application that allows users to upload their content for processing. The processing engine expects UTF8 (and I'm composing XML from multiple users' files), so I need to ensure that I can properly decode the uploaded files. Since I'd…
NVRAM
  • 6,947
  • 10
  • 41
  • 44
8
votes
5 answers

Convert Short Array to String C#

Is it possible to convert short array to string, then show the text? short[] a = new short[] {0x33, 0x65, 0x66, 0xE62, 0xE63}; There are utf16 (thai characters) contains in the array. How can it output and show the thai and english words? Thank…
Fusionmate
  • 679
  • 1
  • 8
  • 18