2

While I've found some hypothetical and theoretical posts relating to this question, the closest I've found is here, and the posted answer deals with the opposite of what I believe I'm looking for help on (just in case that link helps anyone else).

I obtained the following code from a wiki on Github, here. Its implementation seemed pretty straightforward, however, I've not been able to utilize it in its native form.

Here's my the 'Process' code I'm using:

import dask.dataframe as dd

from concurrent.futures import ProcessPoolExecutor
import pandas as pd
import gdelt

gd = gdelt.gdelt(version=2)

e = ProcessPoolExecutor()

def getter(x):
    try:
        date = x.strftime('%Y%m%d')
        d = gd.Search(date, coverage=True)
        d.to_csv("{}_gdeltdata.csv".format(date),encoding='utf-8',index=False)
    except:
        pass

results = list(e.map(getter,pd.date_range('2015 Apr 21','2018 Apr 21')))

Here's the full error:

BrokenProcessPool                         Traceback (most recent call last)
<ipython-input-1-874f937ce512> in <module>()
     21 
     22 # now pull the data; this will take a long time
---> 23 results = list(e.map(getter,pd.date_range('2015 Apr 21','2018 Apr 21')))
     24 
     25 

C:\Anaconda3\lib\concurrent\futures\process.py in_chain_from_iterable_of_lists(iterable)
    364     careful not to keep references to yielded objects.
    365     """
--> 366     for element in iterable:
    367         element.reverse()
    368         while element:

C:\Anaconda3\lib\concurrent\futures\_base.py in result_iterator()
    584                     # Careful not to keep a reference to the popped future
    585                     if timeout is None:
--> 586                         yield fs.pop().result()
    587                     else:
    588                         yield fs.pop().result(end_time - time.time())

C:\Anaconda3\lib\concurrent\futures\_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

C:\Anaconda3\lib\concurrent\futures\_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

*BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.*

Any ideas as to how to resolve this error? I know that if I change the ProcessPoolExecutor to a ThreadPoolExecutor, the problem seems to be resolved (though I haven't run the dataset through all the way, so I can't be entirely sure), however, I believe I'll have a quicker outcome if I utilize the ProcessPoolExecutor.

Ultimately, I'll be using dask to work with the data in Pandas. Thanks in advance.

alofgran
  • 427
  • 7
  • 18
  • @Linwoodc3 - I pulled this code from your GitHub. Phenomenal wiki for it, by the way (super-thorough, clear, clean, etc.), but because I'm new to the Python world, I think I'm missing something here. Would you mind taking a look at the problem above? – alofgran Apr 29 '18 at 04:22

1 Answers1

2

Examples in the documentation always show execution within an if __name__ == '__main__' clause. Hopefully this mcve accurately mimics your use case

def gd(s):
    return s*3

def getter(w):
    return gd(w)

data = list('abcdefg')

def main():
    with ProcessPoolExecutor(max_workers=4) as executor:
        for thing in executor.map(getter, data):
            print(thing)

Executed like this works,

#main()
if __name__ == '__main__':
    main()

But executing like this does not - it throws the BrokenProcessPool error

main()
if __name__ == '__main__':
    #main()

Try ensuring that the line results = list(e.map(getter,pd.date_range(...))) runs in the *__main__* process

wwii
  • 23,232
  • 7
  • 37
  • 77