2

I've got a list of strings, along the lines of list=[a,b,c,d,e].

When I call list[2], the string c is displayed as ASCII; when I call print list[2], however, it's displayed as unicode. Why does this discrepancy exist?

scrollex
  • 2,575
  • 7
  • 24
  • 38
  • For similar reasons to why `"123"` displays differently than `print "123"`. – Scott Hunter Feb 09 '16 at 17:47
  • 2
    Could you show an *unedited* transcript of the phenomenon, please? We don't know what you mean by "calling" - neither strings nor `print` statements are "callable" in Python jargon - and we also don't know what you mean by "ascii" and "unicode". – zwol Feb 09 '16 at 17:47

2 Answers2

3

This is mainly because strings in Python 2 are not text strings but byte strings.

I suppose you are in a REPL environment (a Python console). When you evaluate something in the console, you get its printed representation which is the same as calling print repr() on the expression:

l = ['ñ']
l[0] # should output '\xc3\xb1'
print repr(l[0]) # should output the same

This is because your console is in UTF-8 mode (if you get a different representation for ñ it is because your console uses some other text representation) so when you press ñ you are actually entering two bytes 0xc3 and 0xb1.

repr() is a Python method that always returns a string. For primitive types, this string is a valid source to rebuild the value passed as parameter. This case it returns a string with a sequence of bytes that recreates another string with the ñ encoded as UTF-8. To see this:

repr(l[0]) # should print a string within a string: "'\\xc3\\xb1'"

So when you print it (which is the same as just evaluating in the console), you get the same string without the outer quotes and the escaped characters properly replaced. I.e:

print repr(l[0]) # should output '\xc3\xb1'

But, when you print the value, i.e: print l[0], then you send those two bytes to the console. As the console is in UTF-8 mode, it decodes the sequence and translate it to only one character: ñ. So:

print l[0] # should output ñ

If you want to store text strings, you must use the modifier u before the string. This way:

text = u'ñ'

Now, when evaluating text you will see its Unicode codepoint:

text # should output u'\xf1'

And printing it should recreate the ñ glyph:

print text # should output `ñ`

If you want to convert text into a byte string representation, you need an encoding scheme (such as UTF-8):

text.encode('utf-8') == l[0] # should output True

Similarly, it you want the Unicode representation for l[0], you'll need to decode those bytes:

l[0].decode('utf-8') == text # should output True

All this said, notice in Python 3, default strings are indeed Unicode Strings and you need to prefix the literal notation with b to produce byte strings.

Salva
  • 6,507
  • 1
  • 26
  • 25
2

It's because those two ways of displaying a string use different routes to get to the final result. x by itself in the REPL will invoke repr(x) and display that, but print(x) will invoke str(x) and display that instead. Classes are allowed to define __repr__ and __str__ separately, so they don't always return the same value.

>>> x = u"a"
>>> x
u'a'
>>> print x
a
>>> repr(x)
"u'a'"
>>> str(x)
'a'
>>>
Kevin
  • 74,910
  • 12
  • 133
  • 166