0

I am trying to run a large data set of about 150 MM rows of data through Starmap multiprocessing in Python and Pandas. The .csv file with the dataset is about 18 GB.

# Multiprocessing pool
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count()-1)

# Pool.starmap to apply my_function to the DataFrame
# myfunction returns two items in the result
# args variable contains a list of arguments to my_function
# my_function with starmap returns a list of tuples (my_function_return 1, my_function_return 2)
my_function_results = pool.starmap(my_function , args)

Below is the error that I see:

Process SpawnPoolWorker-1:
Traceback (most recent call last):
  File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\process.py", line 315, in _bootstrap
    self.run()
  File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 114, in worker
    task = get()
  File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\queues.py", line 366, in get
    res = self._reader.recv_bytes()
  File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\connection.py", line 221, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\connection.py", line 323, in _recv_bytes
    return self._get_more_data(ov, maxsize)
  File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\connection.py", line 342, in _get_more_data
    assert left > 0
AssertionError
Traceback (most recent call last):
  File "C:\Users\Dev\Desktop\My_Project\main.py", line 145, in <module>
Traceback (most recent call last):
  File "C:\Users\Dev\Desktop\My_Project\main.py", line 145, in <module>
    my_function_results = pool.starmap(my_function , args)
  File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 771, in get
    raise self._value
  File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 537, in _handle_tasks
    put(task)
  File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\connection.py", line 211, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\connection.py", line 285, in _send_bytes
    ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
OSError: [WinError 87] The parameter is incorrect

The CPU RAM is 128 GB.
The number of CPU processors is 8.

I am sending the entire dataset for multiprocessing. Do I need to increase the RAM or make any changes to the way I feed the datset for the function?

I am curious to know what these terms mean in the error logs: _get_more_data and _send_bytes

0 Answers0