4

http://www.python.org/dev/peps/pep-0100/

PEP 100 states that the internal format, Python Unicode, holds UTF-16 encodings, but addresses the values as UCS-2 (or UCS-4 when compiled with flag --enable-unicode=ucs4).

Why wasn't UTF-16 chosen (a variable length format) as opposed to UCS-2 (fixed length)?

Though the two encodings are largely the same, UTF-16 was already 4 years old when PEP-100 was published (2000 Mar). Was Python Unicode meant to address backwards compatibility issues?

I'm really just curious as to why Python's internal format was implemented using this (seemingly) hybrid approach to store encoded data internally?

A better way to ask my question might be: does anyone have a citation or link with quote from an official document that specifically states why PEP 100 chose to treat UTF-16 as UCS-2 instead of using UTF-16?

mkelley33
  • 5,323
  • 10
  • 47
  • 71
  • 3
    Better yet, why not use UTF-8 or UTF-32? – Keith Thompson Nov 05 '11 at 21:17
  • I would've like to have seen UTF-8 too, but my guess is that UTF-8 was probably a little too bleeding edge at the time since RFC 2279, http://www.ietf.org/rfc/rfc2279.txt wasn't published until January 1998. I don't know much about UTF-32, but I suspect it wasn't chosen do to storage concerns. Nice comment :) – mkelley33 Nov 05 '11 at 21:31
  • Note: Working in character terms with length, indexing, and slicing is much more difficult and inefficient with UTF-8 than UTF-16. Using UTF-8 as an **internal** format (as opposed to **external** format) is **not** a Good Idea. – John Machin Nov 05 '11 at 22:26
  • @eryksun No. I'm asking why UCS-2 was chosen over UTF-16. Though I'd be curious to learn more as to "why it wasn't written to hand UTF-16 surrogate pairs properly". – mkelley33 Nov 05 '11 at 22:40
  • @JohnMachin Why is utf-8 "working in character terms with length, index, and slicing much more difficult and inefficient with UTF-8"? – mkelley33 Nov 05 '11 at 22:43
  • @mkelley33: Because, starting from a known point, often the beginning, you need to step through the bytestring, at each iteration doing `next_byte_pos = current_byte_pos + length_table[bytestring[current_byte_pos]]` – John Machin Nov 05 '11 at 23:20
  • UTF-16 has all the disadvantages of both UTF-8 and UTF-32 combined, yet partakes of none of the advantages of either of them. It’s a bastard of an encoding, and that’s putting it nicely. – tchrist Nov 05 '11 at 23:27
  • @John: And how do you think you step through UTF-16? You can’t do it a code unit at a time any more than you can do so with UTF-8. – tchrist Nov 05 '11 at 23:27
  • @Keith: Great question. You’ll notice that the languages that chose UTF-8 have vastly superior Unicode support compared with Python: Perl way back in 2000, and Go much more recently. UCS-2 was a really bad move, and this is necessary to even start to play the catch-up game. – tchrist Nov 05 '11 at 23:47

1 Answers1

1

Read on a little further: "UCS-2 and UTF-16 are the same for all currently defined Unicode character points" ... and that was true in the year 2000 when the PEP was written. The initial implementation covered only the BMP (first 64K codepoints).

John Machin
  • 81,303
  • 11
  • 141
  • 189
  • I read that and understand that they were essentially the same insofar as code points were concerned, but why choose the older UCS-2 instead of the newer UTF-16 if they were both the same for all code points at the time of the writing? What was the advantage of the fixed-length format over the variable-length format? – mkelley33 Nov 05 '11 at 21:26
  • fixed-width is just easier to process. Also, Unicode is and has been a moving target. It makes sense to adopt unicode features that have been around for a few years. – ObscureRobot Nov 05 '11 at 21:50
  • @mkelley: It was easier. No surrogates to worry about. – John Machin Nov 05 '11 at 22:29
  • @ObscureRobot: are speculating that the reason was "fixed-width is just easier to process," or do you have a citation that I might be able to reference? I don't disagree with you, and I appreciate your comment, but I'd like know if there exists some document or official explanation somewhere corroborates your comment. Thank you! – mkelley33 Nov 05 '11 at 22:51
  • @JohnMachin I really appreciate you taking the time to explain so much in the answer and comments. Is there any chance you might be willing to include a citation in you example? I trust that your answer is correct, but It would be nice to know that " working on Python Unicode said 'we knew it wouldn't be worth the effort to deal with the complexities of surrogates at the time PEP 100 was written, and so we chose the fixed-width UCS-2 that had been in use for several ( > 4 years) throughout out the industry." --, – mkelley33 Nov 05 '11 at 23:31
  • @John: That is untrue. The surrogate mechanism was invented in 1996 for Unicode 2.0, so by 2000 it was well-understood. There really was no excuse for choosing a dead-end internal representation at that point. Neither UTF-8 nor UTF-32 even *have* surrogates, and even with UTF-16 you can account for them if you are careful. Instead they made the double mistake of choosing a dead-end encoding, and revealing the internal representation to the world. The second is just as stupid as Java, but the first is even worse. – tchrist Nov 05 '11 at 23:34
  • 1
    @tchrist my intention here wasn't to discuss or critique the merits of the implementation. I agree with your first two sentences as well as your other statements about surrogates, but your attack on Java and Python doesn't contribute anything useful to answering my question. The negativity in your comments also have the effect of derailing the credibility of the things you said that actually might be true. Too bad. – mkelley33 Nov 05 '11 at 23:49
  • @mkelley33: If you are looking for a citation, I suggest that you rummage through the archives of the python-dev mailing list. – John Machin Nov 05 '11 at 23:55
  • @mkelley33: I’ve just spent almost two years dealing with hundreds of thousands of lines of broken Java code and tens of thousands of lines of broken Python code, that are all broken in the same way: they think Unicode means UCS-2. That means they cannot correctly handle the text corpora they were written for. For example, the massive PubMed Open Access collection is in XML, and so of course does not limit code points to the BMP. All that code breaks on that corpus and many others. Because you can’t guarantee a wide build, we use Perl for new code, not Python. The Java is a way bigger problem. – tchrist Nov 05 '11 at 23:58
  • 1
    @JohnMachin thanks for the tip. I'll take a look and see if I'm able to bring back any useful information to add here. – mkelley33 Nov 06 '11 at 00:01
  • 2
    @tchrist: perhaps got burned by a bad coder(s) that used Python and/or Java in the wrong way to deal with XML. Both Python and Java **can and do handle** the full range of code points outside of the BMP. Many Linux systems come with Python prepared to do so. I've compiled Python on Mac OS X to deal with code points outside the BMP. Please stop trolling. – mkelley33 Nov 06 '11 at 00:10
  • @mkelley33: Yes, I know they can, if you’re careful. I have a wide build. The problem is that the people who wrote the code didn’t understand that Unicode isn’ UCS-2, and so you can’t fix it with a magic bullet. You have to rewrite their code. It is really frustrating. – tchrist Nov 06 '11 at 00:48