Pandas between_time equivalent for Dask DataFrame

Question

I have a Dask dataframe created with dd.read_csv("./*/file.csv") where the * glob is a folder for each date. In the concatenated dataframe I want to filter out subsets of time like how I would with a pd.between_time("09:30", "16:00"), say.

Because Dask's internal representation of the index does not have the nice features of Pandas's DateTimeIndex, I haven' had any success with filtering how I normally would in Pandas. Short of resorting to a naive mapping function/loop, I am unable to get this to work in Dask.

Since the partitions are by date, perhaps that could be exploited by converting to a Pandas dataframe and then back to a Dask partition, but it seems like there should be a better way.

Updating with the example used in Angus' answer.

I guess I don't understand the logic of the queries in the answers/comments. Is Pandas smart enough to not interpret the boolean mask literally as a string and do the correct datetime comparisons?

Does something this not work? `ddf = ddf.loc[(ddf['time'] > "09:30") & (ddf['time'] < "16:00")]` — David Erickson, Jun 18 '20 at 22:17
I tried separately, and doesn't seem to work on a Dask DataFrame. Btw, in my code, by the point I want to filter, I have already used dask.dataframe.to_datetime (analogous to Pandas) and called set_index("Timestamp"). — Sargera, Jun 18 '20 at 22:39

quizzical_panini · Answer 1 · 2020-06-19T22:24:08.037

3

Filtering in Dask works just like pandas with a few convenience functions removed.

For example if you had the following data:

time,A,B
6/18/2020 09:00,29,0.330799201
6/18/2020 10:15,30,0.518081116
6/18/2020 18:25,31,0.790506469

The following code:

import dask.dataframe as dd

df = dd.read_csv('*.csv', parse_dates=['time']).set_index('time')
df.loc[(df.index > "09:30") & (df.index < "16:00")].compute()

(If ran on 18th June 2020) Would return:

time,A,B
2020-06-18 10:15:00,30,0.518081

EDIT:

The above answer filters for the current date only; pandas interprets the time string as a datetime value with the current date. If you'd like to filter values for all days between specific times there's a workaround to strip the date from the datetime column:

import dask.dataframe as dd

df = dd.read_csv('*.csv',parse_dates=['time'])
df["time_of_day"] = dd.to_datetime(df["time"].dt.time.astype(str))
df.loc[(df.time_of_day > "09:30") & (df.time_of_day < "16:00")].compute()

Bear in mind there might be a speed penalty to this method, possibly a concern for larger datasets.

edited Jun 19 '20 at 22:24

answered Jun 18 '20 at 22:50

quizzical_panini

434
2
9

I used your csv file and exact load/query statements. My result is an empty data frame. I am using Pandas 1.0.1 if that makes a difference? – Sargera Jun 19 '20 at 11:53
See updated answer. Hadn't caught that you were looking to filter dataset for all days between two times. – quizzical_panini Jun 19 '20 at 22:24
Hm... That is really sad... Going from a well-prepared `timestamp-index` to a new column (based on `astype(str)`) and _then_ doing a `string compare`... That works... But it is not as you want to implement it. – gies0r Aug 18 '20 at 16:11
I suppose the other way to achieve the same end would be to use map_partitions and just use pd.time_between directly on each partition. No reinvention of the wheel. – quizzical_panini Aug 21 '20 at 21:38

Pandas between_time equivalent for Dask DataFrame

1 Answers1

EDIT: