2

I'm trying to read a file's contents and convert them into what is actually stored in memory if I write

file = open("filename","br")
binary = "0b"
for i in file.read():
    binary += bin(i)[2:]

will binary equal the actual value stored in memory? if so, how can I convert this back into a string?

EDIT: I tried

file = open("filename.txt","br")
binary = ""
for i in file.read():
    binary += bin(i)[2:]
stored = ""
for bit in binary:
    stored += bit
    if len(stored) == 7:
        print(chr(eval("0b"+stored)), end="")
        stored = ""

and it worked fine until it reached a space and then it became weird signs and mixed-up letters.

forever
  • 207
  • 2
  • 8
  • It's not really clear what you're trying to do. `file.read() ` is literally the bytes that are in the file. Could you give an example of what you think is in the file and what you want the result to look like? – Frank Yellin Sep 12 '20 at 21:22
  • I'm trying to do this for any text file in general. also, I want the result to be what's in the file to prove to myself that I actually have the binary version for various purposes – forever Sep 12 '20 at 21:24
  • Also, you may not know that when you loop through a set of bytes, it returns the number representing those bytes, like `ord` does. – forever Sep 12 '20 at 21:30

1 Answers1

2

To get a (somewhat) accurate representation of the string as it is stored in memory, you need to convert each character into binary.

Assuming basic ascii (1 byte per character) encoding:

s = "python"
binlst = [bin(ord(c))[2:].rjust(8,'0') for c in s]  # remove '0b' from string, fill 8 bits
binstr = ''.join(binlst)

print(s)
print(binlst)
print(binstr)

Output

python
['01110000', '01111001', '01110100', '01101000', '01101111', '01101110']
011100000111100101110100011010000110111101101110

For unicode (utf-8), the length of each character can be 1-4 bytes so it's difficult to determine the exact binary representation. As @Yellen mentioned, it may be easier to just convert the file bytes to binary.

Mike67
  • 11,175
  • 2
  • 7
  • 15
  • I found an interesting article describing how to determine how many bytes UTF-8 encoded characters need to be read: https://www.johndcook.com/blog/2019/09/09/how-utf-8-works/ – luthervespers Sep 13 '20 at 00:06
  • @Mike67 so the problem was that `bin` deletes trailing zeros so you need to add them back? – forever Sep 13 '20 at 18:07
  • It deletes leading zeroes, so 00001101 becomes 1101. Need to add back zeros to fill 8 bits. – Mike67 Sep 13 '20 at 18:30