0

I'm trying to edit a text file in-place in python. It is very large (so loading it into memory is not an option). I intend to replace byte-for-byte strings I find inside.

with f as open("filename.txt", "r+b"):
    if f.read(8) == "01234567":
        f.seek(-8, 1)
        f.write("87654321")

However, the write() operation adds onto the end of the file when I tried it:

>>> n.read()
'sdf'
>>> n.read(1)
''
>>> n.seek(0,0)
>>> n.read(1)
's'
>>> n.read(1)
'd'
>>> n.write("sdf")
>>> n.read(1)
''
>>> n.seek(0,0)
>>> n.read()
'sdfsdf'
`

I want the result of that to be sdsdf.

  • This should work with `r+b` mode. It may well not work with any `a` mode. Your code sample at the top uses `r+b` and stream bound to `f`, but your interactive example uses a stream bound to `n`, so I wonder if maybe `n` is opened differently. Or, if not, I note that your `n.read(1)` is not followed by a seek operation (the intermediate seek requirement is annoying, but is standard). – torek Nov 01 '15 at 00:10
  • sorry, the n is opened with: `n = open("test.text", "r+b")`. Intermediate seek requirement? – John F Andrews Nov 01 '15 at 00:13
  • Yes: any time you want to switch from reading to writing, or vice versa, you must invoke `seek` (even just a relative seek of 0 bytes for instance). There are a few exceptions, including "write allowed without seek if read just returned EOF", but it's easier just to always-seek. – torek Nov 01 '15 at 00:15
  • Is that documented somewhere? It works, but – John F Andrews Nov 01 '15 at 00:21
  • The original documentation is the C standard for stdio. Not sure where (if anywhere) Python docs refer back to this, nor why it wasn't fixed in the Python wrappers. For that matter, there's no fundamental reason it can't be corrected in the C library—my original BSD stdio avoided it! – torek Nov 01 '15 at 00:28
  • @torek *Your* original BSD stdio - your in the sense that you used it or wrote it? – user4815162342 Nov 01 '15 at 00:56
  • Wrote (most of it, all the float conversion code was other people's, for instance). – torek Nov 01 '15 at 01:01
  • @torek Impressive credentials. :) Maybe it would make sense to file a Python documentation bug for this - or even an enhancement request to fix it? – user4815162342 Nov 02 '15 at 16:18

2 Answers2

1

The original ANSI / ISO C standards required a seek operation when switching a read-write mode stream from read mode to write mode, and vice versa. This restriction persists, e.g., n1570 includes this text:

When a file is opened with update mode ('+' as the second or third character in the above list of mode argument values), both input and output may be performed on the associated stream. However, output shall not be directly followed by input without an intervening call to the fflush function or to a file positioning function (fseek, fsetpos, or rewind), and input shall not be directly followed by output without an intervening call to a file positioning function, unless the input operation encounters end-of-file. Opening (or creating) a text file with update mode may instead open (or create) a binary stream in some implementations.

For whatever reason this restriction has been imported into Python,1 even though it would be possible for the Python wrappers to handle it automatically.

For what it's worth, the reason for the original ANSI C restriction was the low-budget implementation found on many Unix-based systems: they kept, for each stream, a "current byte count" and "current pointer". The current byte count was 0 if the macro-ized getc and putc operations had to call into underlying implementation, which could check whether a stream was opened in update mode and switch it as needed. But once you successfully obtained a character, the counter would hold the number of characters that could continue to be read from the underlying stream; and once you successfully wrote a character, the counter would hold the number of buffer-locations that allowed adding characters.

This meant that if you did a successful getc that filled an internal buffer, but followed it by a putc, the "written" character from putc would simply overwrite the buffered data. If you had a successful putc but followed with a poorly-implemented getc, you would see un-set value out of the buffer.

This problem was trivial to fix (just provide separate input and output counters, one of which is always zero, and have the functions that implement buffer-refill check for mode-switch as well).


1Citation needed :-)

torek
  • 448,244
  • 59
  • 642
  • 775
0

You can check the difference of following codes:

>>> f = open("file.txt", "r+b")
>>> f.seek(2)
>>> f.write("sdf")
>>> f.seek(0)
>>> f.read()
'sdsdf'


>>> f = open("file.txt", "r+b")
>>> f.read(1)
's'
>>> f.read(1)
'd'
>>> f.write("sdf")
>>> f.seek(0)
>>> f.read()
'sdfsdf'

The pointer of .write is originally at the end of the file. Only .seek() will change its position, but not .read(). So you have to call .seek() before writing the bytes. The following code works well:

>>> f = open("file.txt", "r+b")
>>> f.read(1)
's'
>>> f.read(1)
'd'
>>> f.seek(2)
>>> f.write("sdf")
>>> f.seek(0)
>>> f.read()
'sdsdf'
Hengfeng Li
  • 381
  • 1
  • 5
  • 12