0

I am trying to run some very large time-series data using concurrent.futures.ProcessPoolExecutor(). The dataset contains multiple time series (that are independent). The entire dataset is available in a list of tuples data that I pass through a helper function as follows:

def help_func(daa):
    large_function(daa[0], daa[1], daa[2])

with concurrent.futures.ProcessPoolExecutor() as executor:
    executor.map(help_func, data, chunksize=1)

Now, although the different time-series' contained in data are independent across columns, due the nature of the time-series data, the values within a time series need to be handled one after the other. By ordering the data variable in terms of the different time series, I am sure that map will always make calls sequentially over time.

With executor.map I cannot figure out a way to map a particular time-series to the same core always, or somehow share the state from previous attempts to a process running on a new core.

With the current setup, whenever the processing for a particular timestamp is called on a new core, it starts from the initialization step.

Is there any elegant solution to this issue?

jamie
  • 89
  • 4
  • 11
  • So is `daa` a single time-series? `large_function` should run each time on a single core, so it should work fine as-is in that case – lxop Jul 04 '18 at 10:59
  • `daa` is a single timestamp out of a timeseries, which is one among several within `data` – jamie Jul 04 '18 at 11:02
  • I'm a little fuzzy on the layout of your data. What exactly does `data` look like? Can you show a small example? – lxop Jul 04 '18 at 11:05
  • The original timeseries is contained in a dataframe, from which I make `data` as follows: `data = [(df['columnname'][val], val, columnname) for val in range(len(df))]`. I have multiple classes/instances (one per time-series/columnname), which are then called with the specific timestamp. – jamie Jul 04 '18 at 11:11
  • So you have an instance of `data` for each timeseries? And within each timeseries you need to process each value sequentially (ordered by time)? – lxop Jul 04 '18 at 12:03
  • I call different instances of a class through large_function(). Yes, I have one instance of the class which does processing per time-series. When looking at one of these instances, the calls must arrive sequentially. I need to either map an instance to a single core, or somehow remember and pass the right values if the calls will be across cores. – jamie Jul 05 '18 at 06:46
  • It sounds like you are applying concurrency where it doesn't belong. Rather than mapping out individual timestamps from all of the timeseries, just map out the entire timeseries (i.e., if you have 5 timeseries, each with 1000 timestamps, then you will have 5 jobs, not 5000). Just call large_function in a loop - that way it will be on a single core and each timestamp will be processed sequentially. – lxop Jul 05 '18 at 13:36
  • I agree with your comment. That would be the computationally easy approach too. However, I am building something that will run on live data. So at every new point in time, I am going to have one new datapoint in roughly 10000+ time series. So the group-based approach would be too expensive and repetitive. – jamie Jul 05 '18 at 13:45
  • 1
    Well that's quite a different situation to what you've described. It sounds more like you need to track some state per timeseries in some form of shared data and pull data points from a queue of some sort. The shared state would include both the data state (whatever that is in your case) and the most recently processed timestamp. Workers should pull readings from the queue, check if they are the next reading for the timeseries, put it back in the queue if not, otherwise process the reading – lxop Jul 06 '18 at 16:58
  • Thanks. I will start thinking in terms of this suggestion. With regard to the original question, are there no handles in python to send processes to a certain core? My searches so far have not found anything good here... – jamie Jul 09 '18 at 09:03

0 Answers0