12

I am making a class that relies heavily on regular expressions.

Let's say my class looks like this:

class Example:
    def __init__(self, regex):
        self.regex = regex

    def __repr__(self):
        return 'Example({})'.format(repr(self.regex.pattern))

And let's say I use it like this:

import re

example = Example(re.compile(r'\d+'))

If I do repr(example), I get 'Example('\\\\d+')', but I want 'Example(r'\\d+')'. Take into account the extra backslash where that upon printing, it appears correctly. I suppose I could implement it to return "r'{}'".format(regex.pattern), but that doesn't sit well with me. In the unlikely event that the Python Software Foundation someday changes the manner for specifying raw string literals, my code won't reflect that. That's hypothetical, though. My main concern is whether or not this always works. I can't think of an edge case off the top of my head, though. Is there a more formal way of doing this?

EDIT: Nothing seems to appear in the Format Specification Mini-Language, the printf-style String Formatting guide, or the string module.

Tyler Crompton
  • 12,284
  • 14
  • 65
  • 94

1 Answers1

10

The problem with rawstring representation is, that you cannot represent everything in a portable (i.e. without using control characters) manner. For example, if you had a linebreak in your string, you had to literally break the string to the next line, because it cannot be represented as rawstring.

That said, the actual way to get rawstring representation is what you already gave:

"r'{}'".format(regex.pattern)

The definition of rawstrings is that there are no rules applied except that they end at the quotation character they start with and that you can escape said quotation character using a backslash. Thus, for example, you cannot store the equivalent of a string like "\" in raw string representation (r"\" yields SyntaxError and r"\\" yields "\\\\").

If you really want to do this, you should use a wrapper like:

def rawstr(s):
    """
    Return the raw string representation (using r'') literals of the string
    *s* if it is available. If any invalid characters are encountered (or a
    string which cannot be represented as a rawstr), the default repr() result
    is returned.
    """
    if any(0 <= ord(ch) < 32 for ch in s):
        return repr(s)

    if (len(s) - len(s.rstrip("\\"))) % 2 == 1:
        return repr(s)

    pattern = "r'{0}'"
    if '"' in s:
        if "'" in s:
            return repr(s)
    elif "'" in s:
        pattern = 'r"{0}"'

    return pattern.format(s)

Tests:

>>> test1 = "\\"
>>> test2 = "foobar \n"
>>> test3 = r"a \valid rawstring"
>>> test4 = "foo \\\\\\"
>>> test5 = r"foo \\"
>>> test6 = r"'"
>>> test7 = r'"'
>>> print(rawstr(test1))
'\\'
>>> print(rawstr(test2))
'foobar \n'
>>> print(rawstr(test3))
r'a \valid rawstring'
>>> print(rawstr(test4))
'foo \\\\\\'
>>> print(rawstr(test5))
r'foo \\'
>>> print(rawstr(test6))
r"'"
>>> print(rawstr(test7))
r'"'
Jonas Schäfer
  • 20,140
  • 5
  • 55
  • 69
  • 1
    +1 Though the implementation is flawed (assumes ASCII, does not catch *all* instances of an odd number of backslashes at the end of the string) and the rest is ugly (how about `if any( for c in s)`?). –  Dec 08 '12 at 15:11
  • good point, didn't think about the general problem of an odd number of backslashes, I'll try to extend that. – Jonas Schäfer Dec 08 '12 at 15:11
  • Just got done playing around with your code. This is impressive! I didn't even think about the control characters. I see that your function falls back to the normal string representation in the event of a control character. By the way, `filter` returns an iterator, so there's no need to call `iter`. :) Thank you. – Tyler Crompton Dec 08 '12 at 15:33
  • @TylerCrompton Thanks for thanking! ``filter``: That's dependent on the python version. In Python2, it'll be a list. – Jonas Schäfer Dec 08 '12 at 15:58
  • @delnan Oh, didn't even think about ``any``. Thanks for the suggestion. Cannot fix the other condition without using itertools though. With itertools, i'd do a ``sum(map(lambda x: 1, takewhile(lambda x: x == "\\", reversed(s))))`` off the top of my head. – Jonas Schäfer Dec 08 '12 at 16:02
  • @JonasWielicki, that's probably the best way. A similar, more readable way: `len(tuple(takewhile(lambda x: x == '\\', reversed(s))))`. – Tyler Crompton Dec 08 '12 at 17:39
  • I thought about using a list and taking the length too, but I preferred to go without construction of a list, at least in Py3. OT: tuples are actually more expensive to construct (did some benchmarking in an often-called function inside some GUI framework once) – Jonas Schäfer Dec 08 '12 at 17:41
  • Interesting. I had assumed it was the other way around since they are immutable. Anyway, I don't think either are proper Python and should be broken up across lines into an equivalent suite. – Tyler Crompton Dec 08 '12 at 17:44
  • Exactly the same surprise which I found. Going through the relevant commit logs, it might've been neglectable though, even in that routine (just like 2% speedup). Maybe because they have to setup the hashing infrastructure? – Jonas Schäfer Dec 08 '12 at 18:43
  • One issue with this: it won't work if the string contains the `'` character. – interjay Aug 29 '13 at 11:53
  • Why do you exclude characters in the 0-32 interval? I think all of those are valid in a raw string, and [I know tabs and line feeds are definitely okay in a raw string](https://ideone.com/VN2qgk). – user2357112 Oct 11 '18 at 05:21
  • Aside from that, this function also has problems with raw strings that contain both apostrophes and single quotes, [which can happen when backslashes are used](https://ideone.com/I8AsXz). – user2357112 Oct 11 '18 at 05:24
  • This complex answer indirectly taught me a much simpler lesson. When I'm using Python interactively or a debugger and I want to look at a string variable, I don't just enter its name any more. Instead I: `print(string_var1)` – MarcH Nov 26 '19 at 18:44
  • 1
    @MarcH That may conceal things, try printing ``string_var1 = "foo\rbar"`` for example. Will often not matter, but it may in some cases (which is why stuff like repr() exists) – Jonas Schäfer Nov 26 '19 at 19:38
  • Thanks @JonasSchäfer you're right: for tricky strings you want to use _both_ `string_var1` and `print(string_var1)` in a debugger. For merely counting backslashes though, `print(string_var1)` is enough :-) – MarcH Nov 27 '19 at 18:16