1

I am currently experimenting with how Python 3 handles bytes when reading, and writing data and I have come across a particularly troubling problem that I can't seem to find the source of. I am basically reading bytes out of a JPEG file, converting them to an integer using ord(), then returning the bytes to their original character using the line chr(character).encode('utf-8') and writing it back into a JPEG file. No issue right? Well when I go to try opening the JPEG file, I get a Windows 8.1 notification saying it can not open the photo. When I check the two files against each other one is 5.04MB, and the other is 7.63MB which has me awfully confused.

def __main__():
    operating_file = open('photo.jpg', 'rb')

    while True:
        data_chunk = operating_file.read(64*1024)
        if len(data_chunk) == 0:
            print('COMPLETE')
            break
        else:
            new_operation = open('newFile.txt', 'ab')
            for character in list(data_chunk):
                new_operation.write(chr(character).encode('utf-8'))


if __name__ == '__main__':
    __main__()

This is the exact code I am using, any ideas on what is happening and how I can fix it?

NOTE: I am assuming that the list of numbers that list(data_chunk) provides is the equivalent to ord().

Cœur
  • 37,241
  • 25
  • 195
  • 267
  • 1
    Why are you using `list`? So far as I can see `data_chunk` will be a bytes object that can be iterated on byte at a time. I'm also puzzles why you are specifying `uff-8`. If you are reading bytes then you don't want them converted to characters. – cdarke Apr 16 '16 at 16:15
  • Try your code on a much smaller test file — it doesn't have to be a real JPEG — with a smaller "chunk size", and then compare the size and contents of the two files. You could also easily test whether your assumption is correct. BTW, the proper way to open a file for reading in binary mode in Python 3 is `open(filename, 'r', newline='')`. It's something similar for writing. – martineau Apr 16 '16 at 16:15
  • 2
    Encoding bytes as UTF-8 for writing a JPEG files is wrong. It'll take any byte with a value above 0x7F and encode it as multiple bytes, corrupting the data. – nobody Apr 16 '16 at 16:19
  • I am encoding 'utf-8' because without it I get a writing error, and @martineau I don't add the newline argument because when reading binary I get an error when I use the newline argument. Also, I am still getting higher byte count on the new file when I try using smaller chunksizes. – TheMountainFurnaceGabriel Apr 16 '16 at 16:22
  • Comparing the hexdump of a JPEG with the output of your scripts results in the exact behaviour as described by @AndrewMedico which is causing the file structure of the JPEG to be damaged. – letmutx Apr 16 '16 at 16:22
  • @TheMountainFurnaceGabriel The fact that a piece of code compiles and does not throw a runtime exception does NOT imply that it's actually the logically correct solution to your problem. – nobody Apr 16 '16 at 16:27
  • @AndrewMedico how do you suggest I write back into the file? I need to be able to handle the bytes in an integer form to perform arithmetic operations (Encryption/Decryption) before writing back into the file. – TheMountainFurnaceGabriel Apr 16 '16 at 16:27
  • @martineau: "*the proper way to open a file for reading in binary mode in Python 3 is open(filename, 'r', newline='')*" I can't find any documentation stating that. Can you please share your source? Of course that will read string objects, not bytes. I think we want byte objects in this (rare) case. – cdarke Apr 16 '16 at 16:32
  • @cdarke: It's in the documentation on the built-in [`open()` function](https://docs.python.org/3/library/functions.html#open) where it says that with a `newline=''` argument line endings are returned to the caller untranslated when reading and no newline translation takes place when writing — exactly what is needed to copy data from one file to another without changing it. On some operating systems (Linux), it's a moot point because they don't have a "text" mode, which means that file I/O is effectively always done in "binary" mode. – martineau Apr 16 '16 at 16:58
  • @martineau: Linux C does not have text or binary modes, buy python does, in the docs: "Python distinguishes between binary and text I/O". Your assertion that "the proper way to open a file for reading in binary mode in Python 3 is open(filename, 'r', newline='')" is **wrong**, that *only* affects newlines, nothing else. – cdarke Apr 16 '16 at 17:10
  • @martineau Byte objects are what I wanted in the case. Sorry for any confusion. – TheMountainFurnaceGabriel Apr 16 '16 at 17:14
  • 1
    @cdarke: You're right and I was wrong about having to use `newline=''` to get binary mode. I was confusing binary mode with universal newline translation which is the default for text mode (and which is automatically suppressed in binary mode). Sorry. – martineau Apr 16 '16 at 17:26
  • @martineau: I have to admit that you had me diving for the doc. I can see how setting the newline to an empty string can have a similar effect to binary, and for many applications it would have worked fine. Usually in Python 3 those byte objects are a pain, this question is very much an exception. We all get it wrong sometimes, don't worry about it (and please don't look at the number of down votes I've got in the past). – cdarke Apr 17 '16 at 07:51

2 Answers2

2

Here is a simple example you might wish to play with:

import sys

f = open('gash.txt', 'rb')
stuff=f.read()    # stuff refers to a bytes object
f.close()

print(stuff)

f2 = open('gash2.txt', 'wb')

for i in stuff:
    f2.write(i.to_bytes(1, sys.byteorder))

f2.close()

As you can see, the bytes object is iterable, but in the for loop we get back an int in i. To convert that to a byte I use int.to_bytes() method.

cdarke
  • 42,728
  • 8
  • 80
  • 84
  • The OP is using Python 3, so the way you're opening files for binary mode is incorrect (even though it may work on some operating systems). – martineau Apr 16 '16 at 16:27
  • @martineau: this code works fine on python 3.5. What is wrong with it? – cdarke Apr 16 '16 at 16:28
  • Then you are likely not running it on an OS where it matters, like Windows. Another possibility is that the test file only contains ASCII data. – martineau Apr 16 '16 at 16:33
  • @martineau I am running Windows 8.1, but I am more than sure that the test file has more than ASCII characters. – TheMountainFurnaceGabriel Apr 16 '16 at 16:39
  • @martineau: I'm running on OS X. But the point is that we don't want string objects, we want bytes. `b` is perfectly legitimate to use in Python 3 for a binary file. It is binary, not ASCII, UTF-8, or any encoding, it is just a series of bytes. All that setting `newlines=""` will do is suppress the translation of newline: **that is not binary mode**. – cdarke Apr 16 '16 at 16:39
  • Why are you using `to_bytes()` on a one byte value where byte order is meaningless? – martineau Apr 16 '16 at 17:11
  • @martineau: to convert an `int` object to a byte. There are other ways to do it, that's true. – cdarke Apr 16 '16 at 17:15
  • 1
    I asked because I was (mistakenly) thinking `i` would be of type `byte`, not integer since `stuff` is a `bytes` object (an immutable array of 8-bit values). However since Python doesn't have a `byte` type, iterating over the array produces values of type `int`, which are generally multibyte values. Thanks for bearing with me and allow me to also learn from you answer. – martineau Apr 16 '16 at 17:54
  • @martineau: have a good weekend and I hope you didn't think I was rude to you, I didn't mean to. – cdarke Apr 17 '16 at 07:54
  • 1
    @cdarke: Hardly and same wishes back...your explanations helped clear up some confusions I had about the differences between binary mode and disabling universal newlines — which primarily stem from how the csv module wants files opened in binary mode in Python 2, and `newline=''` mode in Python 3. – martineau Apr 17 '16 at 08:47
0

When you have a code point and you encode it in UTF-8, it is possible for the result to contain more bytes than the original.

For a specific example, refer to the WikiPedia page and consider the hexadecimal value 0xA2.

This is a single binary value, less than 255, but when encoded to UTF8 it becomes 0xC2, 0xA2.

Given that you are pulling bytes out of your source file, my first recommendation would be to simply pass the bytes directly to the writer of your target file.

If you are trying to understand how file I/O works, be wary of encode() when using a binary file mode. Binary files don't need to be encoded and or decoded - they are raw data.

aghast
  • 14,785
  • 3
  • 24
  • 56