1

We tried to parallelize our program in Python by using threads. The problem is, we don't get 100% of the CPU used. The CPU uses all 8 cores but only on usage of roundabout 50-60% sometimes lower. Why does the CPU not work with a 100% workload on the calculation?

We are programming in Python on Windows.

Here is our implementation for the multithreading:

from threading import Thread
import hashlib

class CalculationThread(Thread):
    def init(self, target: str):
        Thread.init(self)
        self.target = target

    def run(self):
        for i in range(1000):
            hash_md5 = hashlib.md5()
            with open(str(self.target), "rb") as f:
                for chunk in iter(lambda: f.read(4096), b""):
                    hash_md5.update(chunk)
            f = hash_md5.hexdigest()
        print(self.getName() + "Finished")

threads = []
for i in range(20):
    t = CalculationThread(target="baden-wuerttemberg-latest.osm.pbf")
    print("Worker " + str(t.getName()) + " started")
    t.start()
    threads.append(t)

for t in threads:
    t.join()

CPU workload while running the calculation:

CPU workload while running the calculation

funie200
  • 3,688
  • 5
  • 21
  • 34
Arikuma
  • 11
  • 1
  • Do you use an SSD or HDD? I mean, the bottleneck could be a disk I/O. – viilpe Dec 03 '20 at 12:37
  • We use a SSD, this one has a workload with 1% so the bottleneck shouldn't be caused by the SSD. – Arikuma Dec 03 '20 at 14:26
  • Task Manager is not the best place to check workload. Sata-3 theoretical bandwith is only 600 MB/s and 20 threads is a lot. I tested your code and got something about 460-470 MB/s I/O Delta Read Bytes in Process Explorer. Perhaps you 'd better try running your code on a RAM-disk or NVME-disk, but I'm not sure. – viilpe Dec 03 '20 at 17:09

1 Answers1

0

Because of the existence of GIL, python is not able to achieve true "parallel" on multiple cores with multi threading, especially for Compute-Intensive tasks.

You get some improvement because your task is also somehow bounded by IO(you read from the disk).

One way to figure out what your program is doing in multiple thread is to use some multi-thread supporting tool like VizTracer. It will tell you how much time is spent in your md5 calculation.

However, the correct way to do it in real parallel, is to use multiprocessing library, probably a Pool to do it in multi process, instead of multi thread.

minker
  • 510
  • 6
  • 3