1

Python 2.x doc says,

Unicode string is a sequence of code points

Unicode strings are expressed as instances of the unicode type

>>> ThisisNotUnicodeString = 'a정정' # What is the memory representation?
>>> ThisisNotUnicodeString
'a\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b'
>>> type(ThisisNotUnicodeString)
<type 'str'>
>>> a = u'a정정' # Which encoding technique used to represent in memory? utf-8?
>>> a
u'a\uc815\uc815\U0001f49b'
>>> type(a)
<type 'unicode'>
>>> b = unicode('a정정', 'utf-8')
>>> b
u'a\uc815\uc815\U0001f49b'
>>> c = unicode('a정정', 'utf-16')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_16.py", line 16, in decode
    return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x9b in position 10: truncated data
>>> 

Question:

1) ThisisNotUnicodeString is string literal. Despite ThisisNotUnicodeString is not a unicode literal, Which encoding technique used to represent ThisisNotUnicodeString in memory? Because there should be some encoding technique to represent or character in memory.

2) Which encoding technique used to represent unicode literal a in memory? utf-8? If yes, How to know the number of bytes occupied?

3) Why c is not represented in memory, using utf-16 technique?

overexchange
  • 15,768
  • 30
  • 152
  • 347
  • What do you mean by "memory representation"? – Maroun Jun 04 '17 at 06:31
  • This might be saner not typed into some console but in a source file with a specified encoding which you then use. – pvg Jun 04 '17 at 07:01
  • 1
    `a = u'a정정'` is decoded from whatever the terminal encoding is. See `sys.stdin.encoding`. We know the terminal encoding is UTF-8 because subsequently `b = unicode('a정정', 'utf-8')` succeeds. `c = unicode('a정정', 'utf-16')` thus fails for the obvious reason that a UTF-8 byte string can't be decoded as UTF-16. The two encodings are nothing alike. – Eryk Sun Jun 04 '17 at 07:04
  • The internal format for `unicode` depends on the build. Python 2 on Windows and some Unix systems uses a narrow build that's internally something like UTF-16, but broken for non-BMP strings because it counts a surrogate pair as two characters in the string length. Most Unix systems use a wide build, which stores each Unicode ordinal as a 4-byte integer. – Eryk Sun Jun 04 '17 at 07:09
  • @eryksun it's never UTF-16. UCS-2 or UCS-4. – pvg Jun 04 '17 at 07:10
  • @pvg, I said a narrow build uses *something* like UTF-16. It does encode non-BMP characters as surrogate pairs. UCS-2, in contrast, has no support for non-BMP characters. – Eryk Sun Jun 04 '17 at 07:12
  • @eryksun "*a UTF-8 byte string can't be decoded as UTF-16*" - Can you please explain more about this? For example, "a" can be represented in both schemes.. so I don't get this part of your comment. – Maroun Jun 04 '17 at 07:14
  • @MarounMaroun, "a" in UTF-16LE is `b'a\x00'`. You can't decode `b'a'` as UTF-16; it's only 1 byte. – Eryk Sun Jun 04 '17 at 07:16
  • @MarounMaroun, do you have a problem with saying "it can't be decoded" instead of more precisely saying it can't be correctly decoded? If it's an even number of bytes it can usually be decoded, but it'll probably result in mostly CJK characters. – Eryk Sun Jun 04 '17 at 07:20
  • @overexchange, in Python 2, `str` is a byte string, and internally in CPython it's stored as a C `char *` null-terminated string (i.e. an array of bytes terminated by `\0`), but it's also a counted Pascal-style string, which allows embedded nulls. An `str` object doesn't necessarily have to be any character-set encoding (e.g. UTF-8 or Windows 1252). It could be binary data as far as Python is concerned. However, you're working in the interactive REPL, and `ThisisNotUnicodeString` was read in from `stdin`, so we know it's in your terminal's encoding, which we know is UTF-8. – Eryk Sun Jun 04 '17 at 07:42
  • @eryksun What if I write this in .py file? – overexchange Jun 04 '17 at 07:43
  • @overexchange, ASCII is the default encoding used to decode source files in Python 2. If you've saved the file using a different encoding, such as as UTF-8, you need to declare it in a coding spec at the top of the file, e.g. `# coding=utf-8`. See [PEP 263](https://www.python.org/dev/peps/pep-0263). – Eryk Sun Jun 04 '17 at 07:46
  • @eryksun Trust me, I saved file using `gedit` with `utf-8` encoding, but I get `'str' object has no attribute 'decode` error – overexchange Jun 04 '17 at 08:05
  • @overexchange, it seems you're running the script in Python 3, in which `str` is a Unicode string, and byte strings are implemented as `bytes`. – Eryk Sun Jun 04 '17 at 08:07
  • @eryksun u are right. I was running in python3, so in python3, `decode()` can be applied to `bytes` type object but not `str` type object – overexchange Jun 04 '17 at 08:11
  • Python 3 supports `u''` literals to aid with writing libraries that have to support Python 2 and 3 from a single source code base, but it's not necessary. `a = 'a정정'` defines a Unicode string in Python 3. Also, the default source encoding is UTF-8 in Python 3, so `# coding=utf-8` is also optional. – Eryk Sun Jun 04 '17 at 08:12
  • Also, as you've noticed, unlike Python 2, you can't `encode` bytes or `decode` a Unicode string in Python 3, as that was a common source of confusion in Python 2. bytes-to-bytes decoding and str-str encoding operations have to explicitly use `codecs.decode` and `codecs.encode`. – Eryk Sun Jun 04 '17 at 08:18

2 Answers2

2

1) ThisisNotUnicodeString is string literal. Despite ThisisNotUnicodeString is not a unicode literal, Which encoding technique used to represent ThisisNotUnicodeString in memory? Because there should be some encoding technique to represent 정 or character in memory.

In the interactive prompt, which encoding will be used to encode Python 2.X's str type depends on your shell encoding, for example if you run the terminal under a Linux system with the encoding of the terminal being UTF-8:

>>> s = "a정정"
>>> s
'a\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b' 

Now try to change the encoding from your terminal window to something else, in this case I've changed the shell's encoding from UTF-8 to WINDOWS-1250:

 >>> s = "a???"

If you try this with a tty session you get a diamonds instead of ? at least under Ubuntu you may get different characters.

As you can conclude which encoding will be used to determine the encoding of str in the interactive prompt is shell-dependent. This applies to code run interactively under Python interpreter, code that's not run interactively will raise an exception:

#main.py
s = "a정정"

Trying to run the code raises SynatxError:

$ python main.py
SyntaxError: Non-ASCII character '\xec' in file main.py...

This is because Python 2.X uses ASCII by default:

>>> sys.getdefaultencoding()
'ascii'

Then, you have to specifiy the encoding explicity in your code by doing this:

#main.py
#*-*encoding:utf-8*-*
s = "a정정"

2) Which encoding technique used to represent unicode literal a in memory? utf-8? If yes, How to know the number of bytes occupied?

Keep in mind that the encoding scheme can differ if you run your code in different shells, I have tested this under Linux, this could be slightly different for Windows, so check your operating system's documentation.

To know the number of bytes occupied, use len:

>>> s = "a정정"
>>> len(s)
11

s occupies exactly 11 bytes.

2) Which encoding technique used to represent unicode literal a in memory? utf-8? If yes, How to know the number of bytes occupied?

Well, it's a confusion, unicode type does not have encoding. It's just a sequence of Unicode character points (a.k.a U+0040 for Commercial At).

3) Why c is not represented in memory, using utf-16 technique?

UTF-8 is an encoding scheme that's different from UTF-16--UTF-8 represents characters' bytes differently from that of UTF-16. Here:

>>> c = unicode('a정정', 'utf-16')

You're essentially doing this:

>>> "a정정"
'a\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b'
>>> unicode('a\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b', 'utf-16')
UnicodeDecodeError: 'utf16' codec can't decode byte 0x9b in position 10: truncated data

This is because you're trying to decode UTF-8 with UTF-16. Again, both use different number of bytes to represent characters, they're just two different encoding schemes--different ways to represent characters in bytes.

For your reference: Python str vs unicode types

GIZ
  • 4,409
  • 1
  • 24
  • 43
  • Regarding the question of internal memory representation of Unicode, many programming environments do use UTF-8 and UTF-16 internally. It's not necessarily a sign of confusion to ask that question. There are trade-offs in time and space for various representations, depending on the distribution of characters in a string. Internally, Python 3 tries to balance this by using a hybrid of UCS-1, UCS-2, and UCS-4 depending on the max ordinal value in each string -- and caches UTF-8 and UTF-16 encodings depending on API requests. – Eryk Sun Jun 04 '17 at 15:32
  • "shell-dependent" should be "console- or terminal-dependent". The shell is just another program using the console or terminal. Windows users are typically unclear about this. Many assume incorrectly that cmd.exe is the console. Generally Unix users have a clearer understanding. – Eryk Sun Jun 04 '17 at 15:39
  • @eryksun Without seeing above answer, let me say, There is less clarity on the reason to failure, in my third question(above). My understanding of the error in third question is, in python2, If I say, `c = unicode('a정정', 'utf-16')`, then, `a정정`is first stored in memory by encoding with `sys.stdin.encoding`(which is utf-8 in my case) and then immediately verify decoding with `utf-16` that gives above error. Is that correct? – overexchange Jun 04 '17 at 16:32
  • 2
    @overexchange, the literal gets read as a sequence of already encoded bytes from stdin or a source file. For example, for UTF-8, the decimal values of the bytes are the following list: `[39, 97, 236, 160, 149, 236, 160, 149, 240, 159, 146, 155, 39]`, where 39 is the ordinal for a single quote. It's not a `u''` literal, so the compiler creates an `str` object with this sequence of bytes (without quotes). This `str` object is passed to the `unicode` constructor, which is told to decode it as UTF-16, which is the wrong encoding and in this case fails since it's not an even number of bytes. – Eryk Sun Jun 04 '17 at 16:40
  • @eryksun Now, my question is, if I say `# coding = utf-8` in `abc.py` and save that file(`abc.py`) using utf-8 source code encoding, then, Is it not enough to use `s="a정정"` in my code instead of saying `s = unicode("a정정", 'utf-8')` or `s = u'a정정'`? Assume am happy with `utf-8` codec for my application. `>>> s = 'a정정' >>> s.decode('utf-8')` works fine – overexchange Jun 04 '17 at 16:49
  • @overexchange, normally in Python 2 `s = "a정정"` is an `str` literal. You haven't asked the compiler to decode the bytes as a `unicode` object using the declared source-file encoding. For that you need a `u''` literal, unless you're using `from __future__ import unicode_literals`. – Eryk Sun Jun 04 '17 at 16:54
  • @overexchange If you need a sequence of bytes rather than actual `unicode` strings, you'll technically be using `str` as `str` of 2.X is the equivalent of `bytes` in Python 3.X. The correct way to represent Unicode strings is to use the `unicode` type. `s = "a정정"` will be encoded by the encoding scheme that you specified with `#*-*coding:encoding_name*-*`, then you should prefix he string with `u` to get `unicode` objects: `s = u'a정정'` this gives you a sequences of Unicode code points, really characters, instead of encoded bytes. – GIZ Jun 04 '17 at 17:00
  • @eryksun In the case of, `s=u'a정정'`, 'a정정' gets read as a sequence of already encoded bytes from stdin(say utf-8) or a source file and gets passed to `unicode()` constructuor, which is told to decode it with utf-8. Is that correct? So, `s=u'a정정'` is similar to saying `s=unicode('a정정', 'utf-8')`. Is that correct? – overexchange Jun 04 '17 at 17:05
  • @overexchange if you try `>>> u'a정정'` you should get : `u'a\uc815\uc815\U0001f49b`' as you can see there's no encoding, but we get a sequence of Unicode characters prefixed with `\u` or `\U`. And if you try `>>> 'a정정'` you'll get `'a\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b'` assuming your terminal\console is UTF-8 by default. I think the encoding of cmd in Windows is set according to your local settings/language or something like that. – GIZ Jun 04 '17 at 17:20
  • @overexchange, it's similar assuming `sys.stdin.encoding` or the declared source-file encoding is UTF-8. But it's not completely the same because `s = unicode('a정정', 'utf-8')` has the compiler create a temporary `str` object. There's no intermediate `str` object for `u'a\uc815\uc815\U0001f49b'`. – Eryk Sun Jun 04 '17 at 17:29
  • @direprobs, the Windows console (conhost.exe) defaults to system's OEM encoding (e.g. codepage 437 in the U.S. or 850 in Western Europe) for its bytes API (e.g. `ReadConsoleA`, `ReadFile`). Python uses the console API directly; the cmd shell isn't involved. cmd itself actually uses the console's wide-character API (e.g. `ReadConsoleW`), which is UTF-16. Windows Python 3.6+ also uses the console's wide-character API, but the new `_WindowsConsoleIO` class transcodes between UTF-16 and UTF-8, so to Python programmers it looks like the encoding is UTF-8, which is easier to work with. – Eryk Sun Jun 04 '17 at 17:42
  • @eryksun So, If my python project has `.py` files saved with source code encoding(utf-8) and has `# coding = utf-8` in every python file,**then**, do I need, `unicode()` type object or `u'UnicodeLiteral'` in my python project for unicode support? Such string literals(`s='a정정'`) should be able to handle unicode support and should be able to read unicode range data coming from network as string literals. Am I correct? – overexchange Jun 04 '17 at 21:30
  • @overexchange, once your text data is all `unicode`, you don't have to worry about mixing up differently encoded data. It's simpler, and you also get access to Python's Unicode character database for string methods, regular expressions, and the `unicodedata` module. – Eryk Sun Jun 04 '17 at 22:05
  • @eryksun So, my understanding of your [comment](https://stackoverflow.com/questions/44351350/string-literal-vs-unicode-literal-vs-unicode-type-object-memory-representation#comment75720352_44352644) is, you would agree to be ok to practice creating python projects, the way, I mentioned [here](https://stackoverflow.com/questions/44351350/string-literal-vs-unicode-literal-vs-unicode-type-object-memory-representation#comment75719808_44352644), rather than having ascii files(.py) in python project. Unless, there are any implications to consider for long/short term – overexchange Jun 05 '17 at 00:10
  • @overexchange, try to do all text processing with Unicode instead of byte strings. Use `u''` string literals and decode byte strings as `unicode` ASAP, which includes decoding UTF-8. When writing to files and on the wire, prefer encoding text as UTF-8. For working with text files, you can use `io.open`, which was backported to Python 2, e.g. `io.open(u'spam.txt', 'w', encoding='utf-8')`; it requires writing `unicode` strings, which will be encoded using the file `encoding`. Note the use of a `unicode` filename. This is required in general to support non-ANSI filenames on Windows. – Eryk Sun Jun 05 '17 at 00:11
  • @direprobs **1)** Hey, answer to first question is, `ThisisNotUnicodeString` will be stored in memory in ascii encoding scheme, irrespective of terminal encoding scheme(given). Isn't it? How you display `ThisisNotUnicodeString` on stdout by changing terminal encoding scheme is a different thing. **2)** Your second answer is incomplete. [Comment](https://stackoverflow.com/questions/44351350/string-literal-vs-unicode-literal-vs-unicode-type-object-memory-representation#comment75705286_44351350) says, How unicode literals get stored in memory? depends on terminal encoding scheme. – overexchange Jun 06 '17 at 21:38
  • @overexchange make a distinction between two things: `str` which is a string of bytes and `unicode` which is a string of Unicode characters. For answer number 1, because `ThisisNotUnicodeString` is `str` then it's stored as bytes integers between (0..255) inclusive. For the second answer: `unicode` types doesn't have encoding. When you say a string is encoded, what is encoding? _"The rules for translating a Unicode string into a sequence of bytes are called an encoding"_. – GIZ Jun 06 '17 at 21:49
  • @overexchange regarding the internal memory representation of Unicode strings in Python, please refer to this question which has an excellent answer: [How is unicode represented internally in Python?](https://stackoverflow.com/questions/26079392/how-is-unicode-represented-internally-in-python). when you print this variable `ThisisNotUnicodeString` you see the `a` this is because `str` prints as ASCII only when possible, but it's itself not ASCII. It's just a sequence of bytes. If you're familiar with Java for example or similar languages, you can think of `str` in 2.X as a byte array. – GIZ Jun 06 '17 at 21:51
  • @direprobs **1)** If unicode literal(`a`) is not encoded, then, how would you store unicode literal in memory? Because there should be a corresponding decoder used to retrieve that data back for display/whatever. **2)** What does it mean to say unicode literal is string of unicode code points? I know about code points, but it mean nothing to me, here. It is just the way you display using `print repr()` and nothing more than that. What is more important for me, is to understand memory representation, otherwise I can't play around with the data – overexchange Jun 06 '17 at 21:55
  • @overexchange In terms of the internal representation for Unicode strings, this is an internal Python issue, which is handled by Python. For a Python programmer, it's just Unicode. Ok, if you're really curious and want to dig deeper into memory representation of strings you can read the answers of the question which I provided in my previous comment. Also refer to https://www.python.org/, there's a great deal of information there. You can also read Python source code which is also available at github. – GIZ Jun 06 '17 at 22:04
  • @direprobs Python documentation does not say that, unicode literal will be stored using utf-8 encoding scheme because terminal encoding scheme is utf-8. string literal is stored using ascii encoding scheme in memory, irrespective of terminal encoding scheme. These are the things that I was seeking from this thread – overexchange Jun 06 '17 at 22:07
  • @overexchange Yes, you're right the terminal window has nothing to do with the internal representation of Unicode strings. What if there's no terminal? Just think of `str` as being 8-bit text which is essentially a sequence of bytes and in my answer I meant when you paste Unicode characters inside `"..."` interactively in terminal without `u"..." `then because `"..."` is `str` your Unicode characters get encoded according to the terminal's encoding scheme. This has nothing to do with memory and applies only when you're working interactively, not when running scripts. This is merely a feature. – GIZ Jun 06 '17 at 22:15
1

Which encoding technique used to represent in memory? utf-8?

You can try the following:

ThisisNotUnicodeString.decode('utf-8')

If you get a result, it's UTF-8, otherwise it's not.

If you want to get the UTF-16 representation of the string, you should first decode it, and then encode with UTF-16 scheme:

ThisisNotUnicodeString.decode('utf-8').encode('utf-16')

So basically, you can decode and encode the given string from/to UTF-8/UTF-16, because all characters can be represented in both schemes.

ThisisNotUnicodeString.decode('utf-8').encode('utf-16').decode('utf-16').encode('utf-8')
Maroun
  • 94,125
  • 30
  • 188
  • 241
  • I have a catch here. If I encode using utf-8 then I can decode using latin-1. Do u think it is not possible? Based on the code points that you pick. Because utf-8 is backward compatible with latin-1 & ascii & cp-1252 & ... – overexchange Jun 08 '17 at 01:17