1

I have this code which I would like to use multi-processing to speed up:

matrix=[]

for i in range(len(datasplit)):
    matrix.append(np.array(np.asarray(datasplit[i].split()),dtype=float))

The variable "datasplit" is a comma-separated list of strings. Each string has around 50 numbers which are separated by a space. For each string, this code adds commas between these numbers instead of spaces, turns the entire string into an array, and turns each individual number into a string. This would now look like a an array of comma-separated strings where each string is 1 of the 50 numbers. The code then turns these strings into floats, so now we have an array of 50 comma separated numbers. After the code has run, printing, "matrix" would give a list of arrays, where each array has 50 comma separated numbers.

Now my problem is that the length of datasplit is huge. It has a length of ~ 10^7. This code takes around 15 minutes to run. I need to run this for 124 other samples of similar size, so I would like to use multiprocessing to speed up the run time.

How exactly would I re-write my code using multiprocessing to get it to run faster?

I appreciate any help.

galaxygal
  • 31
  • 5

2 Answers2

3

The Python standard library provides two options for multiprocessing: The modules multiprocessing and concurrent.futures. The second adds a layer of abstraction onto the first. For simple map-scenarios like yours the usage is pretty simple.

Here's something to experiment with:

import numpy as np
from time import time
from os import cpu_count
from multiprocessing import Pool
from concurrent.futures import ProcessPoolExecutor

def string_to_float(string):
    return np.array(np.asarray(string.split()), dtype=float)

if __name__ == '__main__':

    # Example datasplit
    rng = np.random.default_rng()
    num_strings = 100000
    datasplit = [' '.join(str(n) for n in rng.random(50))
                 for _ in range(num_strings)]

    # Looping (sequential processing)
    start = time()
    matrix = []
    for i in range(len(datasplit)):
        matrix.append(np.array(np.asarray(datasplit[i].split()), dtype=float))
    print(f'Duration of sequential processing: {time() - start:.2f} secs')

    # Setting up multiprocessing
    num_workers = int(0.8 * cpu_count())
    chunksize = max(1, int(len(datasplit) / num_workers))

    # Multiprocessing with Pool
    start = time()
    with Pool(num_workers) as p:
        matrix = p.map(string_to_float, datasplit, chunksize)
    print(f'Duration of parallel processing (Pool): {time() - start:.2f} secs')

    # Multiprocessing with ProcessPoolExecutor 
    start = time()
    with ProcessPoolExecutor(num_workers) as ppe:
        matrix = list(ppe.map(string_to_float, datasplit, chunksize=chunksize))
    print(f'Duration of parallel processing (PPE): {time() - start:.2f} secs')

You should play around with the num_workers and more importantly the chunksize variable. The ones I've used here worked well for me in quite a few situations. You can also let the system decide what to choose, those arguments are optional, but the results can be suboptimal, especially when the amount of data to be processed is large.

For 10 million strings (your range) and chunksize=10000 my machine produced the following results:

Duration of sequential processing: 393.78 secs
Duration of parallel processing (Pool): 73.76 secs
Duration of parallel processing (PPE): 85.82 secs

PS: Why do you use np.array(np.asarray(string.split()), dtype=float) instead of np.asarray(string.split(), dtype=float)?

Timus
  • 10,974
  • 5
  • 14
  • 28
0

This will split your tasks to multiple cores and speed up your performance by atleast 4-8x:

from multiprocessing import Pool
import os
import numpy as np

pool = Pool(os.cpu_count())

# Add your data to the datasplit variable below:
datasplit = []

results = pool.map(lambda x: np.array(np.asarray(x.split()),dtype=float), datasplit)

pool.close()
pool.join()

Serial Lazer
  • 1,667
  • 1
  • 7
  • 15
  • 1
    you don't need to do pool = Pool(os.cpu_count()), pool = Pool(), will always take available cpu cores –  Jun 23 '22 at 08:19