My Goal
Rewriting a loop (synchronous) using multiprocessing to reduce calculation time (it is looping over millions of rows):
def get_info(uid):
res = requests.get(CallURL+uid,
headers={'Content-Type':'application/json',
'Authorization': 'Bearer {}'.format(tokn)})
foundation_date = json.loads(res.text).get('uniteLegale').get('dateCreationUniteLegale')
employee_count = json.loads(res.text).get('uniteLegale').get('trancheEffectifsUniteLegale')
current_data = json.loads(res.text).get('uniteLegale').get('periodesUniteLegale')[0]
company_type = current_data.get('categorieJuridiqueUniteLegale')
trade = current_data.get('activitePrincipaleUniteLegale')
return [uid, foundation_date, company_type, trade, employee_count]
df = pd.DataFrame({'uid':[],
'foundation_date': [],
'type': [],
'trade': [],
'employee_count':[]})
for i in Company.index:
print(i)
new = get_info(Company.uid[i])
df.loc[len(df)] = new
What I have attempted so far
I used the concurrent.futures library to run the function on multiple rows at the same time.
with concurrent.futures.ProcessPoolExecutor() as executor:
rows = Company[:3].index #to avoid looping over millions rows for now
results = [executor.map(get_info, Company.uid[row]) for row in rows]
for result in results:
print(result)
This returns the following:
<generator object _chain_from_iterable_of_lists at 0x030757D0>
<generator object _chain_from_iterable_of_lists at 0x03075920>
<generator object _chain_from_iterable_of_lists at 0x03075DF0>
I then apply list(result)
The issue
I was expecting to get the list get_info() returns in these generator objects but I am getting this error instead:
---------------------------------------------------------------------------
BrokenProcessPool Traceback (most recent call last)
<ipython-input-43-09db72840c98> in <module>
4 results = [executor.map(get_info, Lots.Siren[row]) for row in rows]
5 for result in results:
----> 6 for item in result:
7 print(item)
~\AppData\Local\Programs\Python\Python38-32\lib\concurrent\futures\process.py in _chain_from_iterable_of_lists(iterable)
482 careful not to keep references to yielded objects.
483 """
--> 484 for element in iterable:
485 element.reverse()
486 while element:
~\AppData\Local\Programs\Python\Python38-32\lib\concurrent\futures\_base.py in result_iterator()
609 # Careful not to keep a reference to the popped future
610 if timeout is None:
--> 611 yield fs.pop().result()
612 else:
613 yield fs.pop().result(end_time - time.monotonic())
~\AppData\Local\Programs\Python\Python38-32\lib\concurrent\futures\_base.py in result(self, timeout)
430 raise CancelledError()
431 elif self._state == FINISHED:
--> 432 return self.__get_result()
433
434 self._condition.wait(timeout)
~\AppData\Local\Programs\Python\Python38-32\lib\concurrent\futures\_base.py in __get_result(self)
386 def __get_result(self):
387 if self._exception:
--> 388 raise self._exception
389 else:
390 return self._result
BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
When I try a second time, the objects are empty.
I haven't been able to find a solution online.
Side notes
- I use python 3.6.8 in a local jupyter notebook on a Windows 10 device
- This is my first post on Stack overflow, I hope I did okay. Please suggest how I could improve my question.