File.read() is jumping to weird address in python

Question

The code below

fd = open(r"C:\folder1\file.acc", 'r')
fd.seek(12672)
print str(fd.read(1))
print "after", fd.tell()

Is returning after 16257 instead of the expected after 12673

What is going on here? Is there a way the creator of the file can put some sort of protection on the file to mess with my reads? I am only having issues with a range of addresses. The rest of the file reads as expected.

*Why when I do this stuff with absolutely no error checking does it act strange?* — Ken White, Mar 27 '17 at 22:42
@KenWhite id certainly be open to you showing me how i am supposed to error check this. — blindguy, Mar 27 '17 at 22:44
Would you please provide a full working example with an accompanying file? I ran your code with no problem (with the positions altered for a small file). — Andre S., Mar 27 '17 at 23:17
@overflowed how do I upload files? above is the code. I was having issues with a larger script but that is the simplest code to replicate it. I would expect your test code to work, as this same code will work properly for different addresses. — blindguy, Mar 27 '17 at 23:21
I am, unfortunately, unable to reproduce your issue with the given code. It's working fine on my system. — Andre S., Mar 27 '17 at 23:31
With a pathname like `C:\...` you are clearly using Windows. You also opened the file with `'r'` not `'rb'`, so you have opened it in "text mode". I would therefore not be surprised at file offsets acting odd, but as I don't use Windows, I would not dare try to explain this *particular* set of values. — torek, Mar 27 '17 at 23:42
@torek that was is! I'm doing this relatively complicate backwards engineering with hex editor and file monitoring etc, and I miss a simple 'b' ha. add your comment as an answer if you want rep credit — blindguy, Mar 28 '17 at 12:39

score 3 · Accepted Answer · edited May 23 '17 at 12:02

It looks as though you are trying to deal with a file with a simple "stream of bytes at linearly increasing offsets" model, but you are opening it with 'r' rather than 'rb'. Given that the path name starts with C:\ we can also assume that you are running on a Windows system. Text streams on Windows—whether opened in Python, or in various other languages including the C base for CPython—do funny translations where '\n' in Python becomes the two-byte sequence '\r', '\n' within the bytes-as-stored-in-the-file. This makes file offsets behave in a non-linear fashion (though as someone who avoids Windows I would not care to guess at the precise behaviors).

It's therefore important to open file file with 'rb' mode for reading. This becomes even more critical when you use Python3, which uses Unicode for base strings: opening a stream with mode 'r' produces text, as in strings, type 'str', which are Unicode; but opening it with mode 'rb' produces bytes, as in strings of <class 'bytes'>.

Notes on things you did not ask about

You may use use r+b for writing if you do not want to truncate an existing file, or wb to create a new file or truncate any existing file. Remember that + means "add the other mode", while w means "truncate existing or create anew for writing", so r+ is read-and-write without truncation, while w+ is write-and-read with truncation. In all cases, including the b means "... and treat as stream of bytes."

As you can see, there is a missing mode here: how do you open for writing (only) without truncation, yet creating the file if necessary? Python, like C, gives you a third letter option a (which you can also mix with + and b as usual). This opens for writing without truncation, creating a new file only if necessary—but it has the somewhat annoying side effect of forcing all writes to append, which is what the a stands for. This means you cannot open a file for writing without truncation, position into the middle of it, and overwrite just a bit of it. Instead, you must open for read-plus, position into the middle of it, and overwrite just the one bit. But the read-plus mode fails—raises an OSError exception—if the file does not currently exist.

You can open with r+ and if it fails, try again with w or w+, but the flaw here is that the operation is non-atomic: if two or more entities—let's call them Alice and Bob, though often they are just two competing programs—are trying to do this on a single file name, it's possible that Alice sees the file does not exist yet, then pauses a bit; then Bob sees that the file does not exist, creates-and-truncates it, writes contents, and closes it; then Alice resumes, and creates-and-truncates, losing Bob's data. (In practice, two competing entities like this need to cooperate anyway, but to do so reliably, they need some sort of atomic synchronization, and for that you must drop to OS-specific operations. Python 3.3 adds the x character for exclusive, which helps implement atomicity.)

If you do open a stream for both reading and writing, there is another annoying caveat: any time you wish to "switch directions" you are required to introduce an apparently-pointless seek. ("Any time" is a bit too strong: e.g., after an attempt to read produces end-of-file, you may switch then as well. The set of conditions to remember, however, is somewhat difficult; it's easier to say "seek before changing directions.") This is inherited from the underlying C "standard I/O" implementation. Python could work around it—and I was just now searching to see if Python 3 does, and have not found an answer—but Python 2 did not. The underlying C implementation is also not required to have this flaw, and some, such as mine, do not, but it's safest to assume that it might, and do the apparently-pointless seek.

File.read() is jumping to weird address in python

1 Answers1

Notes on things you did not ask about