I am trying to run some very large time-series data using concurrent.futures.ProcessPoolExecutor(). The dataset contains multiple time series (that are independent). The entire dataset is available in a list of tuples data
that I pass through a helper function as follows:
def help_func(daa):
large_function(daa[0], daa[1], daa[2])
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(help_func, data, chunksize=1)
Now, although the different time-series' contained in data are independent across columns, due the nature of the time-series data, the values within a time series need to be handled one after the other. By ordering the data
variable in terms of the different time series, I am sure that map will always make calls sequentially over time.
With executor.map
I cannot figure out a way to map a particular time-series to the same core always, or somehow share the state from previous attempts to a process running on a new core.
With the current setup, whenever the processing for a particular timestamp is called on a new core, it starts from the initialization step.
Is there any elegant solution to this issue?