0

I have many files on disk need to read, the 1st option is use multi-threads, it perform very well on SSD. (when threads blocked by IO, it will release GIL)

But I wanna achieve similar or faster speed without SSD, so I pre-load them into memory(like store in a dict), and every threads will read each file content from memory. Unfortunately, maybe because of the GIL, there is a lock in the dict, hence its speeds is even slower than loading files from SSD!

So my question is that is there any solution can create a read-only memory buffer without lock/GIL? like ramdisk or something else>

  • If you really want as much speed as possible, how about rewriting your program (or at least the speed-critical parts of it) in C or C++, or some other fully-compiled language? Then you'd have no GIL, and also no interpreter overhead at all, since you'd be running a native executable. – Jeremy Friesner Nov 01 '16 at 05:37

2 Answers2

1

In short, no.

Even though Python (CPython in particular) is a multithread language, at any instant the interpreter can run only one piece of python code. Therefore if your pure python program does not contain blocking I/O (e.g. access lock-free memory buffer), it will degrade a single-threaded program no matter what you do. In face the performance will be worse than an actual single-threaded program because there is overhead in synchronizing with other threads.

(Special thanks Graham Dumpleton!) One of the solution is to write C extensions for CPython. And release GIL when enter the "realm of C". Just be careful that you can't access python stuff without the GIL protection otherwise it will cause subtle bugs, or crash directly.

There are several implementations that do not use GIL, for example, Jython and Cython (Not CPython). You can try using them. But keep in mind that writing a correct multithread program is hard. Writing a fast multithread program is even harder. My suggestion is to write multi-process program instead of multithread. And pass data via IPC or so (let's say, ZeroMQ, it's easy to use and lightweight).

HKTonyLee
  • 3,111
  • 23
  • 34
  • 1
    Not quite. As C threads are used under the cover in CPython, then technically multiple threads can still be running, but only one will be allowed at a time to run Python code. So subtle difference to what you describe. Using C extensions for CPython, if the data they need to operate on doesn't require the Python global interpreter lock for Python data objects, then multiple threads can happily run at the same time. – Graham Dumpleton Nov 01 '16 at 05:13
0

Let me add few points to @HKTonyLee answer.

So Python has this GIL. But it is released when doing for example file I/O. This means that you can parallely read files. Since from processes point of view there is no such thing as file but only file descriptors (assuming posix) then whatever you read it does not have to be stored on the disk.

All in all, if you move your file to (for example) tmpfs or ramdisk or any equivalent then you should obtain even better performance then with SSD. Note however the risk: if you need to modify the file you may lose the update.

freakish
  • 54,167
  • 9
  • 132
  • 169
  • sadly GIL is incredibly bad, the more threads you have that work on CPU processing, the slower will the IO read on UDP sockets for instance be and it will start to drop packets... – Enerccio Aug 12 '18 at 21:35