I am trying to run a large data set of about 150 MM rows of data through Starmap multiprocessing in Python and Pandas. The .csv file with the dataset is about 18 GB.
# Multiprocessing pool
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count()-1)
# Pool.starmap to apply my_function to the DataFrame
# myfunction returns two items in the result
# args variable contains a list of arguments to my_function
# my_function with starmap returns a list of tuples (my_function_return 1, my_function_return 2)
my_function_results = pool.starmap(my_function , args)
Below is the error that I see:
Process SpawnPoolWorker-1:
Traceback (most recent call last):
File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\process.py", line 315, in _bootstrap
self.run()
File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 114, in worker
task = get()
File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\queues.py", line 366, in get
res = self._reader.recv_bytes()
File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\connection.py", line 221, in recv_bytes
buf = self._recv_bytes(maxlength)
File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\connection.py", line 323, in _recv_bytes
return self._get_more_data(ov, maxsize)
File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\connection.py", line 342, in _get_more_data
assert left > 0
AssertionError
Traceback (most recent call last):
File "C:\Users\Dev\Desktop\My_Project\main.py", line 145, in <module>
Traceback (most recent call last):
File "C:\Users\Dev\Desktop\My_Project\main.py", line 145, in <module>
my_function_results = pool.starmap(my_function , args)
File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 372, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 771, in get
raise self._value
File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 537, in _handle_tasks
put(task)
File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\connection.py", line 211, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "C:\Users\Dev\AppData\Local\Programs\Python\Python310\lib\multiprocessing\connection.py", line 285, in _send_bytes
ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
OSError: [WinError 87] The parameter is incorrect
The CPU RAM is 128 GB.
The number of CPU processors is 8.
I am sending the entire dataset for multiprocessing. Do I need to increase the RAM or make any changes to the way I feed the datset for the function?
I am curious to know what these terms mean in the error logs: _get_more_data and _send_bytes