1

As part of an effort to write code that works consistently on both Python 2 and 3, I would like to test for any unadorned string literals (any opening " or ' not preceded by a b or u).

I'm fine with writing test cases, so I just need a function that returns all unadorned string literals across my .py files.

As an example, say I have Python code containing the following:

example_byte_string = b'This is a string of ASCII text or bytes'

example_unicode_string = u"This is a Unicode string"

example_unadorned_string = 'This string was not marked either way and would be treated as bytes in Python 2, but Unicode in Python 3'

example_unadorned_string2 = "This is what they call a 'string'!"

example_unadorned_string3 = 'John said "Is it really?" very loudly'

I want to find all of the strings that are not explicitly marked, like example_unadorned_string, so that I can mark them properly and therefore make them behave the same way when run in Python 2 and 3. It would also be good to accommodate quotes within strings, like example_unadorned_string2 and 3, as these shouldn't have u/b added to the internal quotes. Obviously long term we will drop Python 2 support and only Bytes will need explicit marking. This aligns with the approach recommended by python-future.org: http://python-future.org/automatic_conversion.html#separating-text-from-bytes

I can think of ways to do this with grep that are pretty nasty. AST looks potentially helpful, too. But I feel like somebody must have already solved this problem before, so thought I'd ask.

Aaron
  • 123
  • 7

1 Answers1

2

You might want to explore the tokenize module (python2, python3). A rough Python 3 example would be something like this:

import tokenize
import token

def iter_unadorned_strings(f):
    tokens = tokenize.tokenize(f.readline)
    for t in tokens:
        if t.type == token.STRING and t.string[0] in ['"', "'"]:
            yield t

fname = r'code_file.py'
if __name__ == '__main__':
    with open(fname, 'rb') as f:
        for s in iter_unadorned_strings(f):
            print(s.start, s.end, s.string)
ChrisD
  • 3,378
  • 3
  • 35
  • 40
  • 1
    i think OP is searching for an inspection solution, not a parsing one – bobrobbob Jun 02 '18 at 12:41
  • 3
    But this is a parsing task. Once compiled, nothing in the `bytes`, `str` or `unicode` object holds information about how the literal was shaped. For instance, `dis.dis(compile('"hello" " " "world"', '', 'exec'))` shows only one constant string, not three. – Yann Vernier Jun 02 '18 at 13:12
  • Brilliant, thanks Chris. I had not used tokenize before, but your code seems to do exactly what I need. As you obviously realise, the .string attribute of the string token includes the b/u adornment, if there is one, as the first character. That means your t.string[0] will return "u" for a string literal marked Unicode, "b" for a Bytes string, and " or ' for any that have not been marked. Many thanks. – Aaron Jun 05 '18 at 08:31
  • 1
    Note that in Python2 you need to use `tokenize.generate_tokens` instead of `tokenize.tokenize` and it returns an unnamed tuple instead of a named one. You then also need to use `token.tok_name[t[0]]` to give you the type. – Aaron Jun 08 '18 at 12:07