What is the best way to lag a value in a Dask Dataframe?

Question

I have a Dask Dataframe called data which is extremely large and cannot be fit into main memory, and is importantly not sorted. The dataframe is unique on the following key: [strike, expiration, type, time]. What I need to accomplish in Dask is the equivalent of the following in Pandas:

data1 = data[['strike', 'expiration', 'type', 'time', 'value']].sort_values()
data1['lag_value'] = data1.groupby(['strike', 'expiration', 'type', 'time'])['value'].shift(1)

In other words, I need to lag the variable value within a by group. What is the best way to do this in Dask - I know that sorting is going to be very computationally expensive, but I don't think there is a way around it given what I would like to do?

Thank you in advance!

SultanOrazbayev · Answer 1 · 2021-06-08T11:43:27.797

I'll make a few assumptions, but my guess is that the data is 'somewhat' sorted. So you might have file partitions that are specific to a day or a week or maybe an hour if you are working with high-frequency data. This means that you can do sorting within those partitions, which is often a more manageable task.

If this guess is wrong, then it might be a good idea to incur the fixed cost of sorting (and persisting) the data since it will speed up your downstream analysis.

Since you have only one large file and it's not very big (25GB should be manageable if you have access to a cluster), the best thing might be to load into memory with regular pandas, sort and save the data with partitioning on dates/expirations/tickers (if available) or some other column division that makes sense for your downstream analysis.

It might be possible to reduce memory footprint by using appropriate dtypes, for example strike, type, expiration columns might take less space as categories (vs strings).

If there is no way at all of loading it into memory at once, then it's possible to iterate on chunks of rows with pandas and then saving the relevant bits in smaller chunks, here's rough pseudocode:

df = pd.read_csv('big_file', iterator=True, chunksize=10**4)
for rows in df:
    # here we want to split into smaller sets based on some logic
    # note the mode is append so some additional check on file
    # existence should be added
    for group_label, group_df in rows.groupby(['type', 'strike']):
        group_df.to_csv(f"{group_label}.csv", mode='a')

Now the above might sound weird, since the question is tagged with dask and I'm focusing on pandas, but the idea is to save time downstream by partitioning the data on the relevant variables. With dask it is probably possible to achieve also, but in my experience in situations like these I would run into memory problems due to data shuffling among workers. Of course, if in your situation there were many files rather than one, then some parallelisation with dask.delayed would be helpful.

Now, after you partition/index your data, then dask will work great when operating on the many smaller chunks. For example, if you partitioned the data based on date and your downstream analysis is primarily using dates, then operations like groupby and shift will be very fast because the workers will not need to check with each other whether they have overlapping dates, so most processing will occur within partitions.

What is the best way to lag a value in a Dask Dataframe?

1 Answers1