Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

defines abstract CodePoints and their interactions. It also defines multiple s for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

  • (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
  • Used only for international domain names. (historical contenders were utf-5 and utf-6)
  • GB18030 is the official chinese encoding.
  • UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
  • This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

  • () Early adopters who embraced when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
  • (identical to ucs4 aka modern ) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

857 questions
2
votes
1 answer

UTF-8 string to ordinal value: Java equivalent for Python output

I have the feeling this is most likely a duplicate, but I'm unable to find it. NOTE: My Python knowledge is very limited, so I'm not 100% sure how strings, bytes, and encodings are done in Python. My knowledge about encodings in general is also not…
Kevin Cruijssen
  • 9,153
  • 9
  • 61
  • 135
2
votes
1 answer

Extra character appearing in email subject £ in front of pound symbol

I am using a class system php file to send an HTML email from mysql database, an extra character appears in front of the £pound symbol in the subject title, but the main content of the email is fine. I have tried using a UTF charset for the…
Daniel
  • 33
  • 5
2
votes
1 answer

List of BOM characters

Is there a list of possible BOM characters that are used? So far I have encountered: \x00\x00\xfe\xff UTF-32, big-endian \xff\xfe\x00\x00 UTF-32, little-endian \xfe\xff UTF-16, big-endian \xff\xfe UTF-16,…
user10332687
2
votes
0 answers

How to truly count UTF-8 characters, and emoji's and special characters with different character lengths?

I just want to ask a really confusing question and get a really basic answer to how it all works, basically my problem is when I count character lengths in JavaScript and PHP for symbols and emoji's like ‍❤️‍‍ it comes up 11 characters instead of…
Lol Boi
  • 33
  • 8
2
votes
0 answers

Using UTF-8 in SQL Server 2016

We are installing a new application, the pre-requisite of which says that your database must be configured to use the UTF-8 character set. We are currently using SQL Server 2016, enterprise edition. Our database team mentioned to us that SQL Server…
Newbie
  • 713
  • 2
  • 10
  • 19
2
votes
1 answer

AWS RDS Oracle Standard edition seems to ignore NLS_LENGTH_SEMANTICS

Given the following table: SQL> DESC MM02.MMRZET01; Name Null? Type ----------------------------------------- -------- ---------------------------- LPT_ID NUMBER(19) COU_ISO_ID …
favoretti
  • 29,299
  • 4
  • 48
  • 61
2
votes
1 answer

Scanner.nextInt() NoSuchElementException

I got this code (sorry for the german inside): public void eingabewerte(){ int steuerInt; steuerInt=-1; Scanner myScanner = new Scanner(System.in); System.out.println("Bitte geben Sie die maximal Augenzahl des Wuerfels an…
2
votes
2 answers

boost locale incomplete type boundary_indexing

I am first converting an utf-8 string to utf-32 and then I want unique words to be mapped with their positions. I started with boost locale. #include #include #include #include #include…
Neel Basu
  • 12,638
  • 12
  • 82
  • 146
2
votes
1 answer

Saving Keras Model: UTF - 8 Error

I've built a convolutional neural network in keras that looks like this: model = Sequential() model.add(Convolution2D(nb_filters, nb_conv, nb_conv, border_mode='valid', …
Palash Shah
  • 43
  • 1
  • 5
2
votes
0 answers

mbstring functions - php 7.0 - conversion to utf-8

I am using php 7.0. To make the site fully utf-8 compatible, there are many steps we have to take as explained here. I have doubt about mbstring encoding. The following is the ideal mbstring settings, as I understand, to be placed at beginning of…
Kiran
  • 896
  • 1
  • 6
  • 25
2
votes
1 answer

Difference between combining acute accent and combining acute tone mark and how to normalize

So I have one application (let's call it the client) which uses strings with Diacritic/Accents. This application needs to make a request to another application (let's call it web service) using these strings with a diacritic. This other application…
dade
  • 3,340
  • 4
  • 32
  • 53
2
votes
0 answers

mysql data corrupted after changing encoding

I accidentally changed the encoding of a field from UTF-8 to macroman, after switching back to UTF-8, all Chinese characters were scrambled up, is there any chance I can reverse the process? or the change is permanent ?
user3500286
  • 73
  • 1
  • 3
2
votes
1 answer

Printing Chinese Characters in C++

I've been trying to print Chinese Characters in C++. I've already searched around in the Internet, some said that you have to use wcout, others have suggested other methods. I've also stumbled on this post, where someone uses a piece of…
El3ctroGh0st
  • 61
  • 2
  • 5
2
votes
1 answer

String returns only numbers after separatedBy

I´m trying to separate a string like the following: let path = "/Users/user/Downloads/history.csv" do { let contents = try NSString(contentsOfFile: path, encoding: String.Encoding.utf8.rawValue ) let rows =…
Josch Hazard
  • 323
  • 3
  • 20
2
votes
1 answer

Encoding issue: how to let console print "ć" instead of "c"?

I am working with data from all possible European languages. R does not recognize special characters correctly, e.g. "ć" instead of "c". > "ć" [1] "c" I have come accross this various times and found workarounds (read.csv, and other functions have…
Doctor G
  • 163
  • 9