0

My Goal

Rewriting a loop (synchronous) using multiprocessing to reduce calculation time (it is looping over millions of rows):

def get_info(uid):
    res = requests.get(CallURL+uid, 
                       headers={'Content-Type':'application/json',
                                'Authorization': 'Bearer {}'.format(tokn)})
    foundation_date =  json.loads(res.text).get('uniteLegale').get('dateCreationUniteLegale')
    employee_count = json.loads(res.text).get('uniteLegale').get('trancheEffectifsUniteLegale')
    current_data = json.loads(res.text).get('uniteLegale').get('periodesUniteLegale')[0] 
    company_type = current_data.get('categorieJuridiqueUniteLegale')
    trade = current_data.get('activitePrincipaleUniteLegale')
    return [uid, foundation_date, company_type, trade, employee_count]

df = pd.DataFrame({'uid':[],
                   'foundation_date': [],
                   'type': [],
                   'trade': [],
                   'employee_count':[]})

for i in Company.index:
    print(i)
    new = get_info(Company.uid[i])
    df.loc[len(df)] = new

What I have attempted so far

I used the concurrent.futures library to run the function on multiple rows at the same time.

with concurrent.futures.ProcessPoolExecutor() as executor: 
    rows = Company[:3].index #to avoid looping over millions rows for now
    results = [executor.map(get_info, Company.uid[row]) for row in rows]
for result in results:
    print(result)

This returns the following:

<generator object _chain_from_iterable_of_lists at 0x030757D0>
<generator object _chain_from_iterable_of_lists at 0x03075920>
<generator object _chain_from_iterable_of_lists at 0x03075DF0>

I then apply list(result)

The issue

I was expecting to get the list get_info() returns in these generator objects but I am getting this error instead:

---------------------------------------------------------------------------
BrokenProcessPool                         Traceback (most recent call last)
<ipython-input-43-09db72840c98> in <module>
      4         results = [executor.map(get_info, Lots.Siren[row]) for row in rows]
      5     for result in results:
----> 6         for item in result:
      7             print(item)

~\AppData\Local\Programs\Python\Python38-32\lib\concurrent\futures\process.py in _chain_from_iterable_of_lists(iterable)
    482     careful not to keep references to yielded objects.
    483     """
--> 484     for element in iterable:
    485         element.reverse()
    486         while element:

~\AppData\Local\Programs\Python\Python38-32\lib\concurrent\futures\_base.py in result_iterator()
    609                     # Careful not to keep a reference to the popped future
    610                     if timeout is None:
--> 611                         yield fs.pop().result()
    612                     else:
    613                         yield fs.pop().result(end_time - time.monotonic())

~\AppData\Local\Programs\Python\Python38-32\lib\concurrent\futures\_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433 
    434             self._condition.wait(timeout)

~\AppData\Local\Programs\Python\Python38-32\lib\concurrent\futures\_base.py in __get_result(self)
    386     def __get_result(self):
    387         if self._exception:
--> 388             raise self._exception
    389         else:
    390             return self._result

BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

When I try a second time, the objects are empty.

I haven't been able to find a solution online.

Side notes

  1. I use python 3.6.8 in a local jupyter notebook on a Windows 10 device
  2. This is my first post on Stack overflow, I hope I did okay. Please suggest how I could improve my question.
  • and where is `get_info`? All problem can be inside this function. It seems it returns some generator. – furas May 26 '22 at 01:05
  • I can't run it but maybe you should simply get `list(result)` to get all values from `generator` – furas May 26 '22 at 01:06
  • or maybe you have to use `result.get()` or `result.result()` to get value. You should check documentation. – furas May 26 '22 at 01:07
  • Thank you for helping me clarify. I added `get_info` above, and yes my error is the result of `list(result)` – yellow-raven May 26 '22 at 01:15
  • `.get()` and `.result()` return an AttributeError – yellow-raven May 26 '22 at 01:18
  • I tested it on some values and it needs `next(result)` to get value from generator. But all problem is that you use it in wrong way. `map()` is running function with different values WITHOUT using `for`-loop (and without `[ ]`) - like `results = executor.map(get_info, uids)` - and this should gives directly list with all values. Probably for your code could work `results = executor.map(get_info, Company[:3].uid)` – furas May 26 '22 at 08:50

1 Answers1

0

To get value from generator you can use next()

for result in results:
    print( next(result) )

But main problem is that you use map() in wrong way.

map() gets function and list of values and it runs this function with every value from list separatelly and it returns list with all results - so it replaces for-loop and []

with concurrent.futures.ProcessPoolExecutor() as executor: 
    results = executor.map(get_info, Company[:3].uid)

Maybe first you should try .apply() because probably it may also use multiprocessing.

results = Company[:3].uid.apply(get_info)

EDIT:

Example code which I used for tests

import pandas as pd
import concurrent.futures

# --- functions ---

def get_info(uid):
    print(f'uid: {uid}')
    return [uid, 'a', 'b', 'c', 'd']

# --- main ---

Company = pd.DataFrame({'uid':[1,2,3,4,5,6,8,9]})

# --- version 1 ---

print('\n--- version 1 ---\n')

df = pd.DataFrame({
    'uid':[],
    'foundation_date': [],
    'type': [],
    'trade': [],
    'employee_count':[]
})
                      
for i in Company[:3].uid:
    result = get_info(i)
    print(result)
    df.loc[len(df)] = result
    
print('--- df ---')

print(df)    

# --- version 2 ---

print('\n--- version 2 ---\n')

df = pd.DataFrame({
    'uid':[],
    'foundation_date': [],
    'type': [],
    'trade': [],
    'employee_count':[]
})

with concurrent.futures.ProcessPoolExecutor() as executor: 
    results = executor.map(get_info, Company[:3].uid)

results = list(results)  # map gives generator and it needs `list()`
                         # to get all values and use it many times

for result in results:
    print(result)

df = pd.DataFrame(results, columns=['uid', 'foundation_date', 'type', 'trade', 'employee_count'])

print('--- df ---')

print(df)

# --- version 3 ---

print('\n--- version 3 ---\n')

results = Company[:3].uid.apply(get_info)
print(results)

df = pd.DataFrame(results.to_list(), columns=['uid', 'foundation_date', 'type', 'trade', 'employee_count'])

print('--- df ---')

print(df)    
furas
  • 134,197
  • 12
  • 106
  • 148
  • Thank you for your time and for sharing your knowledge. Version 2 still gives me a 'BrokenProcessPool' error. Version 3 works and I will need to test if it is quicker than my loop. – yellow-raven May 26 '22 at 21:42
  • Version 1 and 3 have the same performance - 6.8 seconds for ten rows. – yellow-raven May 26 '22 at 21:51
  • see documentation for [BrokenProcessPool](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.process.BrokenProcessPool). Maybe one of your `requests` had problem and this makes problem - maybe you have to run code in `try/except` – furas May 26 '22 at 21:54
  • I did the `try/except` yesterday as I decided to go forward with the loop for now (there were indeed a few 404 returned). But there should be something else because I still have this `BrokenProcessPool` issue that I need to investigate further. Could be Jupyter, my version of python, or windows maybe. I'll try a couple of things tomorrow. Thank you for the link to the documentation, I'll read that first. – yellow-raven May 27 '22 at 01:11
  • I have no idea what can be the problem. It doesn't show any details about problem so it hard to say what really makes problem. – furas May 27 '22 at 01:59