4

I've read that certain Python functions implemented in C, which I assume includes file.read(), can release the GIL while they're working and then get it back on completion and by doing so make use of multiple cores if they're available.

I'm using multiprocess to parallelize some code and currently I've got three processes, the parent, one child that reads data from a file, and one child that generates a checksum from the data passed to it by the first child process.

Now if I'm understanding this right, it seems that creating a new process to read the file as I'm currently doing is uneccessary and I should just call it in the main process. The question is am I understanding this right and will I get better performance with the read kept in the main process or in a separate one?

So given my function to read and pipe the data to be processed:

def read(file_path, pipe_out):
    with open(file_path, 'rb') as file_:
        while True:
            block = file_.read(block_size)
            if not block:
                break
            pipe_out.send(block)
    pipe_out.close()

I reckon that this will definitely make use of multiple cores, but also introduces some overhead:

multiprocess.Process(target=read, args).start()

But now I'm wondering if just doing this will also use multiple cores, minus the overhead:

read(*args)

Any insights anybody has as to which one would be faster and for what reason would be much appreciated!

redrah
  • 1,204
  • 11
  • 20
  • Why don't you just try it out, which is faster and whether multiple cores are used when reading in the main process? – bpgergo Aug 31 '12 at 11:01
  • 1
    You will have to actually create a thread. Just calling ``read(*args)`` won't do any threading (sorry if you know this, it's a bit unclear in your question). – Jonas Schäfer Aug 31 '12 at 11:03
  • @Jonas that was my understanding; no explicit creation of threads means no threads get created... But the research I've done has led me to believe that CPython can in some cases create threads of its own accord to do low level stuff, though they're not exposed to the API. I can't help feeling I've misunderstood something somewhere though which is why I'm interested in getting a better understanding. – redrah Aug 31 '12 at 11:17
  • 2
    Is your goal to read the file as fast as possible? File reading is in any case I/O bound. You cannot increase the data rate by using more than one CPU cores at the same time. Also 'low level' CPython is not doing this. As long as you read the file in one dedicated process or thread (even in case of CPython with its GIL a thread is fine), you will get as much data per time as you can get from the storage device. – Dr. Jan-Philip Gehrcke Aug 31 '12 at 11:26
  • 1
    @redrah: [`subprocess.Popen.communicate`](http://docs.python.org/library/subprocess.html#subprocess.Popen.communicate) and maybe a few more functions in the standard library implicitly create a threads, but `subprocess` does so to avoid a deadlock, not for performance reasons. Try grepping the files in `Lib/` in the Python source for `"[Tt]hread"`; you won't find much outside the actual threading libraries. – Fred Foo Aug 31 '12 at 11:26
  • @Jan-PhilipGehrcke No I appreciate that there's no way I can speed up I/O, in fact that's part of my rationale behind using separate threads/processes. The files in question are sometimes remote, and therefore are subject to fluctuations in transfer rate so I didn't want the processing to be waiting for the read to complete and vice versa. – redrah Aug 31 '12 at 11:39
  • This suggests that there are several files you are working on at the same time. For such a scenario, of course, concurrency is the way to go. – Dr. Jan-Philip Gehrcke Aug 31 '12 at 11:43

2 Answers2

2

Okay, as came out by the comments, the actual question is:

Does (C)Python create threads on its own, and if so, how can I make use of that?

Short answer: No.

But, the reason why these C-Functions are nevertheless interesting for Python programmers is the following. By default, no two snippets of python code running in the same interpreter can execute in parallel, this is due to the evil called the Global Interpreter Lock, aka the GIL. The GIL is held whenever the interpreter is executing Python code, which implies the above statement, that no two pieces of python code can run in parallel in the same interpreter.

Nevertheless, you can still make use of multithreading in python, namely when you're doing a lot of I/O or make a lot of use of external libraries like numpy, scipy, lxml and so on, which all know about the issue and release the GIL whenever they can (i.e. whenever they do not need to interact with the python interpreter).

I hope that cleared up the issue a bit.

Jonas Schäfer
  • 20,140
  • 5
  • 55
  • 69
  • So in my original example I'd be as better off using `threading.thread(read)` over `multiprocessing` as the bulk of the work done in file.read() will be done with the GIL released and will therefore be able to leverage multiple cores? – redrah Aug 31 '12 at 12:39
  • It _could_ be better. Without knowing your whole program, it will be hard to tell. Multiprocessing often has a larger synchronization overhead if any synchronization is happening (cause you don't share _all_ memory), but multithreading is more expensive with python. You really should just do a test with it. Make some benchmarking setup, possibly with faked slow input via a FIFO, and test both models. Afaik the ``multi(processing|thread)`` modules can be used interchangably. – Jonas Schäfer Aug 31 '12 at 12:51
2

I think this is the main part of your question:

The question is am I understanding this right and will I get better performance with the read kept in the main process or in a separate one?

I assume your goal is to read and process the file as fast as possible. File reading is in any case I/O bound and not CPU bound. You cannot process data faster than you are able to read it. So file I/O clearly limits the performance of your software. You cannot increase the read data rate by using concurrent threads/processes for file reading. Also 'low level' CPython is not doing this. As long as you read the file in one process or thread (even in case of CPython with its GIL a thread is fine), you will get as much data per time as you can get from the storage device. It is also fine if you do the file reading in the main thread as long as there are no other blocking calls that would actually slow down the file reading.

Dr. Jan-Philip Gehrcke
  • 33,287
  • 14
  • 85
  • 130
  • There's actually many ways to speed up I/O, for example with basic [Parallel HDF5](https://www.hdfgroup.org/2015/08/parallel-io-with-hdf5/) or [ADIOS](https://csmd.ornl.gov/adios). You said that concurrent threads/processes cannot increase read performance, but concurrent nodes certainly can. With Parallel HDF5 I regularly read 2TB files in seconds rather than dozens of hours. The correct answer is the one that was accepted by the OP, and written earlier than this one. – Nike Jan 17 '23 at 02:19