How can I convert a string type with existing unicode characters?

Question

Using python 2.7, I have an endpoint which is returning strings containing the characters '\u2019', '\u2018', and '\u2026'. I haven't been able to resolve these with any combination of encoding, decoding, etc.

The actual strings are something like the following: "\u2018Ralph Breaks the Internet\u2019 and \u2018Creed II\u2019 Are Thanksgiving Hits"

Here is a code snippet

#!/usr/bin/python
# -*- coding: utf-8 -*-
...
>>> '\u2019'.encode('ascii')
'\\u2019'
>>> '\u2019'.encode('utf-8')
'\\u2019'
>>> '\u2019'.decode('utf-8')
u'\\u2019'
>>>'\u2019'.decode('ascii')
u'\\u2019'

I am running command line, but have also tried to output to files to no avail. There are many similar threads on these types of issues, but haven't found one that works for this. I think I could do some sort of regex character lookup and substitution, but that seems clunky.

It’s not clear if you have a Unicode string with the single character represented by the escape code `u'\u2018'` or a byte string with the six-character text `'\u2018'`. The former you `print` as explained in my answer. The latter you `.decode('unicode-escape')`. — Mark Tolonen, Nov 27 '18 at 17:01

score 1 · Answer 1 · answered Nov 26 '18 at 01:12

Have you checked this thread: Removing \u2018 and \u2019 character

These are Unicode for quote character.

u"\u2018Ralph Breaks the Internet\u2019 and \u2018Creed II\u2019 Are Thanksgiving Hits"

returns:
‘Ralph Breaks the Internet’ and ‘Creed II’ Are Thanksgiving Hits'

Hope this helps.

edilio · Answer 2 · 2018-11-27T00:10:53.540

0

I have upvoted @Ying Cai but I will give you some hints: if you add from __future__ import unicode_literals when you are using Python 2.7 the whole file will be treated as in Python 3.X, meaning that all the string literals will be treated as unicode. If you are on Python 2.7 and use u"\u2018Ralph Breaks the Internet\u2019 and \u2018Creed II\u2019 Are Thanksgiving Hits" without adding from __future__ import unicode_literals the string is now unicode and it should work as you expected.

@Mark I just updated my answer because I was really thinking on from __future__ import unicode_literals instead on # -*- coding: utf-8 -*-. Thanks for your comment.

edited Nov 27 '18 at 00:10

answered Nov 26 '18 at 01:27

edilio

1,778
14
13

Adding `#coding` only declares the encoding of the source file. For these examples with only ASCII characters this has no effect. – Mark Tolonen Nov 26 '18 at 21:51
`unicode_literals` only affects string literals typed in source code. It doesn’t affect how the data was received. – Mark Tolonen Nov 27 '18 at 07:57

score 0 · Answer 3 · answered Nov 26 '18 at 21:56

You need 3 things to print non-ASCII characters on Python 2.

Use print
The terminal encoding must support the characters.
The font must support the characters:

Example (Windows console using code page 437):

C:\>py -2
Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:25:58) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> u'\u2018Ralph\u2019'     # Didn't use `print`
u'\u2018Ralph\u2019'
>>> print u'\u2018Ralph\u2019'  # cp437 doesn't support these characters.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2018' in position 0: character maps to <undefined>
>>> ^Z

Changing code page to one that supports the characters:

C:\>chcp 1252
Active code page: 1252

C:\>py -2
Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:25:58) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print u'\u2018Ralph\u2019'
‘Ralph’

Note that the latest Python 3 works differently. The code page doesn't matter (but the font does):

C:\>py -3
Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> '\u2018Ralph\u2019'
'‘Ralph’'
>>> print('\u2018Ralph\u2019')
‘Ralph’
>>> print(ascii('\u2018Ralph\u2019'))  # Old behavior to see escape codes.
'\u2018Ralph\u2019'

How can I convert a string type with existing unicode characters?

3 Answers3