I have two dataframes: links
has two datetime columns called onset
and offset
and each row is an event. The other dataframe is called sensors
, is indexed with datetime index of freq 1m, and has ~600 columns, each for a sensor-id. Essentially, for each links
row, I want to use the onset and offset values as a time range to slice the related rows from sensors
, aggregate them across rows by getting the mean and then concatenate the single row mean-sensor values to the links
dataframe horizontally. I have managed to do that with following Pandas code, and it works but I have a lot of data and is extremely slow.
def search_sensors(sensors, start_time, stop_time):
s = sensors[start_time: stop_time]
s = s.mean()
return s
# add column names of sensor-ids
links[sensors.columns] = None
for index, row in links.iterrows():
start_time = row['start_time']
stop_time = row['stop_time']
mean_sensors = search_sensors(sensors, start_time, stop_time)
links.iloc[index, sensors.columns] = mean_sensors.to_list()
I have already tried some stuff with Dask but no luck.
- Using
dask.delayed()
with Pandas, I get aUserWarning: Large object of size 35.62 MiB detected in task graph:
mean_sensors_list = []
for index, row in links.iterrows():
start_time = row['start_time']
stop_time = row['stop_time']
mean_sensors = dask.delayed(search_sensors)(sensors, start_time, stop_time)
links_list.append(mean_sensors) # mean_sensors is delayed object containing a pandas.Series of shape (600, nan)
results = dask.compute(*mean_sensors_list)
- Using
dask.dataframe()
with the following code is as slow as Pandas, and I don't see any parallelization indications in Dask dashboard.
sensors_dd = dd.from_pandas(sensors_interp, npartitions=1)
links_dd = dd.from_pandas(links, npartitions=1)
mean_sensors_list = []
for index, row in links_dd.iterrows():
start_time = row['start_time']
stop_time = row['stop_time']
mean_sensors = search_sensors(sensors_dd, start_time, stop_time)
links_list.append(mean_sensors) # mean_sensors is a dask.Series of shape (600, nan)
results = dask.compute(*mean_sensors_list)
- Using both 1 and 2, i.e.,
mean_sensors = dask.delayed(search_sensors)(sensors_dd, start_time, stop_time)
, mean_sensors is a delayed object containing a dask.Series of shape (600, nan) but the execution is very slow. The Dashboard shows some parallelization of 3 tasks (search_sensors, finalize, from_pandas) and the 4 workers show very low CPU usage. Additionaly Ubuntu shows a message of Low disk Space when I run it.