2

I have a piece of code that contains strings with UTF-8 escape sequences written in decimal, such as

my_string = "Hello\035"

which should be then interpreted as

Hello#

I don't mind parsing the decimal value, so far I've used things like this for the entire string and this seems to work the best (no error and does something):

print(codecs.escape_decode(my_string)[0].decode("utf-8"))

But the numbering seems quite off, because I have to use \043 escape sequence in order to get the hastag (#) decoded properly, and it's the same for all the other characters.

Saeko
  • 421
  • 1
  • 4
  • 14
  • 1
    Where are these escape sequences coming from? The sender may be unaware that Python interprets a backslash-and-three-digits in a string as an _octal_ value, not a decimal one. Maybe if you clear up this confusion, then they'll be happy to use 043 (which is the octal form of 35) to represent a hash tag. – Kevin Feb 25 '19 at 13:10
  • This is a school project and we must be able to handle input that is written such as: \xyz, and the range is from 000 to 999, but it is decimal, not octal – Saeko Feb 25 '19 at 13:13
  • @Sandeep yes, I'm aware of that, but I get the escape sequence written in decimal, since it's a school project I can't really change the input to something more reasonable even if I wanted to – Saeko Feb 25 '19 at 13:18
  • @Sandeep: The OP knows that I think. The question is how to handle it when it's written in decimal. – martineau Feb 25 '19 at 13:18
  • The problem is that it is not possible if it is Python source. chr(35), "\043" "#", "\x23" are representations for the exact same character in respectively decimal, octal, ascii and hexadecimal. And all of them are equal so the Python interpretor cannot guess which one was used. AFAIK, only the inspect module can, but it is hard to use it to find the source of a variable value... – Serge Ballesta Feb 25 '19 at 13:28
  • 1
    ... Things are different if it is an *input* value. In that case you get the individual characters and it is possible to parse that input. – Serge Ballesta Feb 25 '19 at 13:30
  • The idea that `my_string = "Hello\035"` involves something called "UTF-8 escape sequences" is fundamentally flawed. Execution as Python source also does not result in the value of the my_string variable containing any kind of escape sequence. – Tom Blodget Feb 26 '19 at 00:07

1 Answers1

2

You can't unambiguously detect and replace all \ooo escape sequences from a string literal, because those escape sequences are irretrievably replaced with their corresponding character values before your first line of code ever runs. As far as Python is concerned, "foo\041" and "foo!" are 100% identical, and there's no way to determine that the former object was defined with an escape sequence and the latter wasn't.

If you have some flexibility in regards to the form of the input data, then you might still be able to do what you want. For example, if you're allowed to use raw strings instead of regular strings, then r"Hello\035" won't get interpreted as "Hello, followed by a hash tag" before run time. It will be interpreted as "Hello, followed by backslash, followed by 0 3 and 5". Since the digit characters are still accessible, you can manipulate them in your code. For example,

import re

def replace_decimal_escapes(s):
    return re.sub(
        #locate all backslashes followed by three digits
        r"\\(\d\d\d)",
        #fetch the digit group, interpret them as decimal integer, then get cooresponding char
        lambda x: chr(int(x.group(1), 10)), 
        s
    )

test_strings = [
    r"Hello\035",
    r"foo\041",
    r"The \040quick\041 brown fox jumps over the \035lazy dog"
]

for s in test_strings:
    result = replace_decimal_escapes(s)
    print("input:  ", s)
    print("output: ", result)

Result:

input:   Hello\035
output:  Hello#
input:   foo\041
output:  foo)
input:   The \040quick\041 brown fox jumps over the \035lazy dog
output:  The (quick) brown fox jumps over the #lazy dog

As a bonus, this approach also works if you get your input strings via input(), since backslashes typed in that prompt by the user aren't interpreted as escape sequences. If you do print(replace_decimal_escapes(input())) and the user types "Hello\035", then the output will be "Hello#" as desired.

Kevin
  • 74,910
  • 12
  • 133
  • 166