How to ensure all string literals are unicode in python

Question

I have a fairly large python code base to go through. It's got an issue where some string literals are strings and others are unicode. And this causes bugs. I am trying to convert everything to unicode. I was wondering if there is a tool that can convert all literals to unicode. I.e. if it found something like this:

print "result code %d" % result['code']

to:

print u"result code %d" % result[u'code']

If it helps I use PyCharm (in case there is an extension that does this), however I am would be happy to use a command like too as well. Hopefully such a tool exists.

@unutbu You are absolutely right. I edited the question to include that. Silly me. — mmopy, Mar 16 '13 at 14:12
`from future import unicode_literals`? But it's entirely possible that the problem isn't string literals but other sources of byte strings (e.g. "wrong" choice of API, or missing `encode`/`decode` calls). — , Mar 16 '13 at 14:13
@MattDMo sadly we are using some 3rd party libraries that are only supported for Python 2 — mmopy, Mar 16 '13 at 14:14
@delnan That worked great. I'd love to run a script over the source and make everything consistent and then tell people to always do unicode strings. But this seems to fix some of the places I found as bugs. — mmopy, Mar 16 '13 at 14:22

unutbu · Accepted Answer · 2013-03-17T03:20:26.747

You can use tokenize.generate_tokens break the string representation of Python code into tokens. tokenize also classifies the tokens for you. Thus you can identify string literals in Python code.

It is then not hard to manipulate the tokens, adding 'u' where desired:

import tokenize
import token
import io
import collections

class Token(collections.namedtuple('Token', 'num val start end line')):
    @property
    def name(self):
        return token.tok_name[self.num]

def change_str_to_unicode(text):    
    result = text.splitlines()
    # Insert a dummy line into result so indexing result
    # matches tokenize's 1-based indexing
    result.insert(0, '')
    changes = []
    for tok in tokenize.generate_tokens(io.BytesIO(text).readline):
        tok = Token(*tok)
        if tok.name == 'STRING' and not tok.val.startswith('u'):
            changes.append(tok.start)

    for linenum, s in reversed(changes):
        line = result[linenum]
        result[linenum] = line[:s] + 'u' + line[s:]
    return '\n'.join(result[1:])

text = '''print "result code %d" % result['code']
# doesn't touch 'strings' in comments
'handles multilines' + \
'okay'
u'Unicode is not touched'
'''

print(change_str_to_unicode(text))

yields

print u"result code %d" % result[u'code']
# doesn't touch 'strings' in comments
u'handles multilines' + u'okay'
u'Unicode is not touched'

This worked out great. Many thanks for putting this answer together. — mmopy, Mar 17 '13 at 19:11

pradyunsg · Answer 2 · 2013-03-18T04:44:03.920

Try this (uses regex), and it's shorter than @unutbu's solution.
But there's s loop hole, the strings containing # won't work with this.

import re
scode = '''
print "'Hello World'" # prints 'Hello World'
u'Unicode is unchanged'"""
# so are "comments"'''
x1 = re.compile('''(?P<unicode>u?)(?P<c>'|")(?P<data>.*?)(?P=c)''')

def repl(m):
    return "u%(c)s%(data)s%(c)s" % m.groupdict()

fcode = '\n'.join(
      [re.sub(x1,repl,i)
       if not '#' in i
       else re.sub(x1,repl,i[:i.find('#')])+i[i.find('#'):]
       for i in scode.splitlines()])
print fcode

Outputs:

print u"'Hello World'" # prints 'Hello World'
u'Unicode is unchanged'
# so are "comments"

For # I have this (and it's longer than @unutbu's solution :| )

import re
scode = '''print "'Hello World'"  # prints 'Hello World'
u'Unicode is unchanged'
# so are "comments"
'#### Hi' # 'Hi' '''

x1 = re.compile('''(?P<unicode>u?)(?P<c>'|")(?P<data>.*?)(?P=c)''')

def in_string(text,index):
    curr,in_l,in_str,level = '',0,False,[]

    for c in text[:index+1]:
        if c == '"' or c == "'":
            if in_str and curr == c:
                instr = False
                curr = ''
                in_l -= 1
            else:
                instr = True
                curr = c
                in_l += 1
        level.append(in_l)
    return bool(level[index])

def repl(m):
    return "u%(c)s%(data)s%(c)s" % m.groupdict()

def handle_hashes(i):
    if i.count('#') == 1:
        n = i.find('#')
    else:
        n = get_hash_out_of_string(i)
    return re.sub(x1,repl,i[:n]) + i[n:]

def get_hash_out_of_string(i):
    n = i.find('#')
    curr = i[:]
    last = (len(i)-1)-''.join(list(reversed(i))).find('#')
    while in_string(curr,n) and n < last:
        curr = curr[:n]+' '+curr[n+1:]
        n = curr.find('#')
    return n

fcode = '\n'.join(
    [re.sub(x1,repl,i)
     if not '#' in i
     else handle_hashes(i)
     for i in scode.splitlines()])

print fcode

Output:

print u"'Hello World'"  # prints 'Hello World'
u'Unicode is unchanged'
# so are "comments"
u'#### Hi' # 'Hi'

I discourage the use of regular expressions to parse/manipulate irregular languages like Python, especially since there’s a perfectly fine Python parser included in the language’s standard library. Hence -1. — David Foerster, Jan 29 '19 at 14:12

How to ensure all string literals are unicode in python

2 Answers2