0

I'm trying to thread my code for better performance, using the multiprocessing library's Process module.

The skeleton of code is to create dictionaries for each thread that they work on, and after it's all done, the dictionaries are summed and saved to a file. The resources are created like:

histos = {}
for int i in range(number_of_threads):
    histos[i] = {}
    histos[i]['all'] =      ROOT.TH1F objects
    histos[i]['kinds_of'] = ROOT.TH1F objects
    histos[i]['keys'] =     ROOT.TH1F objects

Then in the Processes, each thread works with its own histos[thread_number] object, working on the contained ROOT.TH1Fs. However, my problem is that apparently if I start the threads with Process like this:

proc = {}
for i in range(Nthreads):
    it0 = 0 + i * n_entries / Nthreads  # just dividing up the workload
    it1 = 0 + (i+1) * n_entries / Nthreads 
    proc[i] = Process(target=RecoAndRecoFix, args=(i, it0, it1, ch,histos)) 
    # args: i is the thread id (index), it0 and it1 are indices for the workload,
    # ch is a variable that is read-only, and histos is what we defined before, 
    # and the contained TH1Fs are what the threads put their output into.
    # The RecoAndFix function works inside with histos[i], thus only accessing
    # the ROOT.TH1F objects that are unique to it. Each thread works with its own histos[i] object.
    proc[i].start()

then the threads do have access their histos[i] objects, but cannot write to them. To be precise, when I call Fill() on the TH1F histograms, no data is filled because it cannot write to the objects because they are not shared variables.

So here: https://docs.python.org/3/library/multiprocessing.html I've found that I should instead use multiprocessing.Array() to create an array that can be both read and written to by the threads, like this:

typecoder = {}
histos = Array(typecoder,number_of_threads)
for int i in range(number_of_threads):
    histos[i] = {}
    histos[i]['all'] =      ROOT.TH1F objects
    histos[i]['kinds_of'] = ROOT.TH1F objects
    histos[i]['keys'] =     ROOT.TH1F objects

However, it won't accept dictionary as a type. It will not work, it says TypeError: unhashable type: 'dict'

So what would be the best approach to solve this issue? What I need is to pass an instance of every "all kinds of keys" stored in dictionaries to each thread, so they work on their own. And they must be able to write these received resources.

Thanks for your help, and sorry if I'm overlooking something trivial, I did threaded code before, but not yet with python.

WhiteWolf
  • 25
  • 6

2 Answers2

1

The missing piece is the distinction is between "process" and "thread"; you mix them in your post, but your approach will only work with threads, not with processes.

Threads all share memory; all of them will refer to the same dictionary, and can therefore use it to communicate with each other and with the parent.

Processes have separate memory; each will get its own copy of the dictionary. If they want to communicate, they have to communicate by other means (for example, using multiprocessing.Queue). On the other hand, this means they get the safety of separation.

An additional complication in Python is "the GIL"; threads will mostly share the same Python interpreter serially, only running in parallel when doing I/O, accessing the network or with a few libraries that make special provision for it (numpy, image processing, a couple of others). Meanwhile, processes get full parallelism.

Jiří Baum
  • 6,697
  • 2
  • 17
  • 17
  • Thanks, switching it from Process to Thread indeed allows the threads to use the dictionary in the args. However, this seems much slower - with Process they popped up and started immediately, Threads seem to start very slow. Also, strange indexing errors popped up in the target function, rather randomly. I don't understand why. With Process in ran reliably, it was just that there's no output. With Thread, there's indexing errors? All I did was change proc[i] = Process(..) to proc[i] = threading.Thread(...) Should I have changed something else as well, to accommodate to the Thread change? – WhiteWolf Apr 12 '21 at 15:59
  • Yeah, (a) threads will mostly share the same Python interpreter serially, and (b) processes get the safety of separation. If you have jobs that mostly use Python code (with little I/O, network or numpy operations), switch back to processes and pass the results back via a multiprocessing.Queue or similar. – Jiří Baum Apr 13 '21 at 02:56
  • 1
    Thanks for all the help! The end solution was to indeed revert to Process-es, but instead of trying to communicate their partial results back to the parent process, they now each save it to their own ROOT output file. Another script can then later just collect these saved files and stack them. It all works fine now, thanks again for the explanations :) – WhiteWolf Apr 15 '21 at 13:50
0

The Python multiprocessing module has a manager class that provides dictionaries that can be shared across threads and processes.

See the documentation for examples: https://docs.python.org/3/library/multiprocessing.html#sharing-state-between-processes

RaJa
  • 1,471
  • 13
  • 17
  • It's more complicated that that. In the dictionary are objects whose methods are presumably called that conceivably update the objects. These updates need to be reflected back to the main process. They need to be "managed" objects too. – Booboo Apr 12 '21 at 14:49