0

I am trying to speed up how long it takes me to process a file in python. My idea is to split the task into n threads.

For example, if I have a file that has 1300 items in it. I want each thread to process every nth item. Each item has no dependancy on any other item so order doesn't matter here

So the workflow would be something like this for each thread:

1) open file
2) iterate through items
3) if nth item then process, otherwise continue

I am using the threading library to do this but I am not seeing any performance improvements.

Here is the pseudocode:

def driver(self):
        threads = []
        # Just picked 10 as test so trying to create 10 threads
        for i in range(0,10):
            threads.append(threading.Thread(target=self.workerFunc, args=(filepath, i, 10)))

        for thread in threads:
            thread.start()

        for thread in threads:
            thread.join()

def workerFunc(self, filepath):
        with open(filepath, 'rb') as file:
                obj = ELFFile(file)
                for item in obj.items:
                        if (item is not nth item):
                                continue
                        else:
                                process this item

Since every thread is just reading the file, it should be able to scan through the file freely without caring about what other threads are doing or getting blocked by them, right?

What am I overlooking here?

The only thing I can think of is that the library I'm using to format these files (pyelftool ELFFile) has something internal that is blocking but I can't find it. Or is there something fundamentally flawed with my plan?

EDIT: just to note, there are 32 cpus on the system I am running this on

TreeWater
  • 761
  • 6
  • 13
  • 1
    It's unlikely that you will get any performance advantage [due to the GIL](https://realpython.com/python-gil/). If you are processing files and performance is important it's worthwhile to switch to more performant tools or languages for processing. – Jan Christoph Terasa Dec 05 '19 at 06:25
  • 1
    +1 to @JanChristophTerasa comment. Have you tried a similar approach with the `multiprocessing` module (which can achieve actual parallelism). https://docs.python.org/2/library/multiprocessing.html – kingkupps Dec 05 '19 at 06:30
  • @kingkupps Good point about multiprocessing, but I have found that usually this is only useful when not dealing with shared state, i,e, the processes can run independently, and you aggregate the result in the end. But instead of using `multiprocessing` I found it far easier to just write single-process Python code and use [GNU Parallel](https://www.gnu.org/software/parallel/) for multiprocessing. It also offers the ability to split an input file into chunks to be processed. – Jan Christoph Terasa Dec 05 '19 at 06:38
  • I had a similar kind of job to do. I had files in the number of 500-600. I chunked them into 8 (based on CPU available though) and use multiprocessing. I think when you are using multithreading it is not true parallel. For example in Arduino microcontroller the I/O are slow, so there if you use multi threading, in between I/O request and response CPU will do other job. In your case if you have for ex. single core CPU then it is not possible to do that job faster since it the same CPU which is doing that job. and all others are in queue. My suggestion will be to use multiprocessing. – Epsi95 Dec 05 '19 at 06:57

0 Answers0