0

I have two dataframes: links has two datetime columns called onset and offset and each row is an event. The other dataframe is called sensors, is indexed with datetime index of freq 1m, and has ~600 columns, each for a sensor-id. Essentially, for each links row, I want to use the onset and offset values as a time range to slice the related rows from sensors, aggregate them across rows by getting the mean and then concatenate the single row mean-sensor values to the links dataframe horizontally. I have managed to do that with following Pandas code, and it works but I have a lot of data and is extremely slow.

def search_sensors(sensors, start_time, stop_time):
    s = sensors[start_time: stop_time]
    s = s.mean()
    return s

# add column names of sensor-ids
links[sensors.columns] = None

for index, row in links.iterrows():
    start_time = row['start_time']
    stop_time = row['stop_time']
    mean_sensors = search_sensors(sensors, start_time, stop_time)
    links.iloc[index, sensors.columns] = mean_sensors.to_list()

I have already tried some stuff with Dask but no luck.

  1. Using dask.delayed() with Pandas, I get a UserWarning: Large object of size 35.62 MiB detected in task graph:
    mean_sensors_list = []

    for index, row in links.iterrows():
        start_time = row['start_time']
        stop_time = row['stop_time']
        mean_sensors = dask.delayed(search_sensors)(sensors, start_time, stop_time)
        links_list.append(mean_sensors)   # mean_sensors is delayed object containing a pandas.Series of shape (600, nan) 

    results = dask.compute(*mean_sensors_list)
  1. Using dask.dataframe() with the following code is as slow as Pandas, and I don't see any parallelization indications in Dask dashboard.
    sensors_dd = dd.from_pandas(sensors_interp, npartitions=1)
    links_dd = dd.from_pandas(links, npartitions=1)

    mean_sensors_list = []

    for index, row in links_dd.iterrows():
        start_time = row['start_time']
        stop_time = row['stop_time']
        mean_sensors = search_sensors(sensors_dd, start_time, stop_time)
        links_list.append(mean_sensors)   # mean_sensors is a dask.Series of shape (600, nan)

    results = dask.compute(*mean_sensors_list)
  1. Using both 1 and 2, i.e., mean_sensors = dask.delayed(search_sensors)(sensors_dd, start_time, stop_time), mean_sensors is a delayed object containing a dask.Series of shape (600, nan) but the execution is very slow. The Dashboard shows some parallelization of 3 tasks (search_sensors, finalize, from_pandas) and the 4 workers show very low CPU usage. Additionaly Ubuntu shows a message of Low disk Space when I run it.
estraven
  • 1
  • 1

1 Answers1

0

New to Dask, and I was not familiar with map_partitions(). The solution to the problem was the following:

res = links_dd.map_partitions(lambda df: df.apply((lambda row: search_sensors(sensors, row.start_time, row.stop_time)), axis=1)).compute()

Extremely fast in my 4-cores laptop.

estraven
  • 1
  • 1