3

I'm currently writing a python script that processes very large (> 10GB) files. As loading the whole file into memory is not an option, I'm right now reading and processing it line by line:

for line in f:
....

Once the script is finished it will run fairly often, so I'm starting to think about what impact that sort of reading will have on my disks lifespan.

Will the script actually read line by line or is there some kind of OS-powered buffering happening? If not, should I implement some kind of intermediary buffer myself? Is hitting the disk that often actually harmful? I remember reading something about BitTorrent wearing out disks quickly exactly because of that kind of bitwise reading/writing rather than operating with larger chunks of data.

I'm using both a HDD and an SSD in my test environment, so answers would be interesting for both systems.

Karl Wolf
  • 188
  • 3
  • 10
  • 3
    Whatever your script does, it will cause the most simple, sequential read on disk. There's nothing more performant and less disk-wearing than that. Don't worry. Buffering, caching and prefetching by the OS will make sure you get decent performance without further action on your side. – JimmyB Feb 15 '16 at 10:30

1 Answers1

7

Both your OS and Python use buffers to read data in larger chunks, for performance reasons. Your disk will not be materially impacted by reading a file line by line from Python.

Specifically, Python cannot give you individual lines without scanning ahead to find the line separators, so it'll read chunks, parse out individual lines, and each iteration will take lines from the buffer until another chunk must be read to find the next set of lines. The OS uses a buffer cache to help speed up I/O in general.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343