While I've found some hypothetical and theoretical posts relating to this question, the closest I've found is here, and the posted answer deals with the opposite of what I believe I'm looking for help on (just in case that link helps anyone else).
I obtained the following code from a wiki on Github, here. Its implementation seemed pretty straightforward, however, I've not been able to utilize it in its native form.
Here's my the 'Process' code I'm using:
import dask.dataframe as dd
from concurrent.futures import ProcessPoolExecutor
import pandas as pd
import gdelt
gd = gdelt.gdelt(version=2)
e = ProcessPoolExecutor()
def getter(x):
try:
date = x.strftime('%Y%m%d')
d = gd.Search(date, coverage=True)
d.to_csv("{}_gdeltdata.csv".format(date),encoding='utf-8',index=False)
except:
pass
results = list(e.map(getter,pd.date_range('2015 Apr 21','2018 Apr 21')))
Here's the full error:
BrokenProcessPool Traceback (most recent call last)
<ipython-input-1-874f937ce512> in <module>()
21
22 # now pull the data; this will take a long time
---> 23 results = list(e.map(getter,pd.date_range('2015 Apr 21','2018 Apr 21')))
24
25
C:\Anaconda3\lib\concurrent\futures\process.py in_chain_from_iterable_of_lists(iterable)
364 careful not to keep references to yielded objects.
365 """
--> 366 for element in iterable:
367 element.reverse()
368 while element:
C:\Anaconda3\lib\concurrent\futures\_base.py in result_iterator()
584 # Careful not to keep a reference to the popped future
585 if timeout is None:
--> 586 yield fs.pop().result()
587 else:
588 yield fs.pop().result(end_time - time.time())
C:\Anaconda3\lib\concurrent\futures\_base.py in result(self, timeout)
430 raise CancelledError()
431 elif self._state == FINISHED:
--> 432 return self.__get_result()
433 else:
434 raise TimeoutError()
C:\Anaconda3\lib\concurrent\futures\_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
*BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.*
Any ideas as to how to resolve this error? I know that if I change the ProcessPoolExecutor to a ThreadPoolExecutor, the problem seems to be resolved (though I haven't run the dataset through all the way, so I can't be entirely sure), however, I believe I'll have a quicker outcome if I utilize the ProcessPoolExecutor.
Ultimately, I'll be using dask to work with the data in Pandas. Thanks in advance.