0

I am working with a CType function that requires a byte string that is being read in from a file. If I put the string into the script, it will work, as long as I denote the string as a string literal (i.e. with 'r') and then convert it to a byte string. But if I just read it in as a byte string it does not work. Is there a way to read in a file as a string literal?

if __name__ == '__main__':
    a = r"\x00hello"
    with open('some_file', 'rb') as f: # some file contains only "\x00hello"
        b = f.read()
    c = b"\x00hello"

    x = CtypeObj.Function(a.encode('utf-8', errors='ignore')) # success!
    y = CtypeObj.Function(b)                                  # failure!
    z = CtypeObj.Function(c)                                  # failure!
meowcat
  • 171
  • 1
  • 12
  • `rb` flag is used to read as bytes. Use `r` to read as string, and use `encoding=utf-8`. It might fail if a non utf-8 character is found in the file. – Kris Jul 12 '22 at 03:55
  • 2
    "string literal" means a string written into source code. The "r" makes a string literal also a "raw" string that does not treat the backslash specially. What does `b` look like? Can you print it for us to see? – tdelaney Jul 12 '22 at 03:56
  • `a` starts with a backslash while `c` start with a NUL character. Your API seems to not like that NUL. – tdelaney Jul 12 '22 at 04:01
  • Are you aware that `r"\x00" == "\\x00"`? That might be the problem: you might be misunderstanding how data is represented in a string. In that case, there are existing questions about that, like [Why do backslashes appear twice?](/q/24085680/4518341) If the file literally contains quotation marks and a backslash, you might want to read [How to convert string representation of list to a list](/q/1894269/4518341), but I'd question why you have a file like that in the first place. Maybe it's actually JSON? – wjandrea Jul 12 '22 at 04:06
  • 1
    If the goal is to encode the NUL in the file, you may be able to `import codecs` and then do `b = codecs.escape_encode(b)`. – tdelaney Jul 12 '22 at 04:08
  • 1
    In the above, `print(ascii(a.encode('utf-8', errors='ignore')))` as well as `print(ascii(b))` and `print(ascii(c))`. Post the results. This will unambiguously show us the *exact* content of each string. – Mark Tolonen Jul 12 '22 at 05:45

2 Answers2

0

The line that you point at as a success likely isn't doing what you think it is doing either:

a = r"\x00hello"

That defines a string of 9 characters, \, x, etc. Calling a.encode('utf-8', errors='ignore') takes that string and encodes the characters in the string using utf-8 and returns a bytes value of that encoding. (which CtypeObj.Function() accepts)

I would assume that you don't really want that \00 part passed to the function?

Reading from the 'rb' mode file gets you a bytes value as well, but the encoding of the file will be the encoding of that bytes value. If you need it to be utf-8 encoding (and the file might not be), then you should instead open the file as 'r', read the value as a string, and encode with b.encode('utf-8').

And finally this line:

c = b"\x00hello"

This just creates a length 6 bytes value, with the first byte being the 0 byte, and the rest the values for the 5 letters. However, that's not automatically a utf-8 encoding, and certainly not the same as you had before. Again, it would seem you don't want that \x00 at the start, since it's very unusual for a string to start with a null character like that.

As indicated in the comments, r"\x00hello" and 'hello' are all string literals, but that's only meaningful in the context of code. In terms of data, you only have strings of characters (str) and bytes values (sometimes called a string of bytes). A "literal" is a way to write either in code directly:

s = 'hello'   # a string literal
b = b'hello'  # a bytes literal for the same text (under most encodings)

s == b.decode()  # True
b == s.encode()  # True

If you read a file using mode 'r', you get strings. If you use a file using mode 'rb', you get bytes.

Grismar
  • 27,561
  • 4
  • 31
  • 54
0

Try this:

if __name__ == '__main__':
    with open('./file.txt', 'rb') as f:
        # read `\x00hello` from file, remove trailing newline
        line = f.read().rstrip()
        # decode the unicode escapes, then re-encode
        line = line.decode('unicode-escape').encode('utf-8')

    print(line)
    print(b'\x00hello')

    print(line == b'\x00hello')

Adapted from advice from this answer.


[~] $ cat file.txt
\x00hello
[~] $ python script.py
b'\x00hello'
b'\x00hello'
True
matthew-e-brown
  • 2,837
  • 1
  • 10
  • 29
  • Please vote or flag to close duplicates as duplicates. The extra context for the problem doesn't make it a different problem - it only defocuses the question. – Karl Knechtel Aug 05 '22 at 03:14