3

I have a large dataset in the form of the following dataframe that I previously loaded from avro files

timestamp id category value
2021-01-01 00:00:00+00:00 a d g
2021-01-01 00:10:00+00:00 a d h
2021-01-01 00:10:00+00:00 a e h
2021-01-01 00:00:00+00:00 b e h

I would like to pivot the category column (which contains on the order of 50 different categories) and kind of deduplicate along the timestamp and id columns so the result looks like this

id timestamp d e
a 2021-01-01 00:00:00+00:00 g nan
a 2021-01-01 00:10:00+00:00 h h
b 2021-01-01 00:00:00+00:00 nan h

I know how I would achieve this in pandas using multiindices together with the stack/unstack operations, however my dataset is way too large to use pandas without manual batch processing and dask does not support multiindices. Is there some way this can be efficiently done with dask?

Edit:

As noted by @Dahn, I've created a minimal synthetic example with pandas:


import pandas as pd

records = [
    {'idx': 0, 'id': 'a', 'category': 'd', 'value': 1},
    {'idx': 1, 'id': 'a', 'category': 'e', 'value': 2},
    {'idx': 2, 'id': 'a', 'category': 'f', 'value': 3},
    {'idx': 0, 'id': 'b', 'category': 'd', 'value': 4},
    {'idx': 1, 'id': 'c', 'category': 'e', 'value': 5},
    {'idx': 2, 'id': 'c', 'category': 'f', 'value': 6}
]

frame = pd.DataFrame(records)
   idx id category  value
0    0  a        d      1
1    1  a        e      2
2    2  a        f      3
3    0  b        d      4
4    1  c        e      5
5    2  c        f      6
frame = frame.set_index(['id', 'idx', 'category'], drop=True).unstack().droplevel(0, axis=1).reset_index()
frame.columns.name = ''
  id  idx    d    e    f
0  a    0  1.0  NaN  NaN
1  a    1  NaN  2.0  NaN
2  a    2  NaN  NaN  3.0
3  b    0  4.0  NaN  NaN
4  c    1  NaN  5.0  NaN
5  c    2  NaN  NaN  6.0


sobek
  • 1,386
  • 10
  • 28
  • 1
    I don't really understand the question. I don't see where the `d` and `e` columns are coming from, but might be missing something obvious. – Powers Oct 08 '21 at 16:45
  • @Powers They are the values of the `category` column pivoted to be columns. Think `pandas'` `unstack` operation applied to a multiindex. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.unstack.html – sobek Oct 10 '21 at 14:29
  • 1
    IMO it would help the question if you provide example data in machine-readable form and Pandas implementation. – Dahn Oct 11 '21 at 08:05
  • 2
    @Dahn I have added an example of how to perform such a pivot operation with `pandas`. – sobek Oct 11 '21 at 08:54

2 Answers2

2

I don't believe Dask implements this as of October 2021. This is likely because there's no support for multi-index, which unstack requires. There has been some work on this recently though.

However, I think this still should be possible using the apply-concat-apply paradigm (and apply_concat_apply function).

The solution below works for the example you've provided and in principle, I think, it should work generally, but I am not sure. Please proceed with caution and, if possible, check that results agree with what Pandas gives you. I have also posted this as a feature request on Dask's github itself.

import dask.dataframe as dd

# Create Dask DataFrame out of your `frame`
# npartitions is more than 1 to demonstrate this works on a partitioned datataset
df = dd.from_pandas(frame, npartitions=3)

# Dask needs to know work out what the categories are
# Alternatively you can use df.categorize
# See https://docs.dask.org/en/latest/dataframe-categoricals.html
category = 'category'
df[category] = df[category].astype(category).cat.as_known()

# Dask needs to know what the resulting DataFrame looks like
new_columns = pd.CategoricalIndex(df[category].cat.categories, name=category)
meta = pd.DataFrame(columns=new_columns, 
                    index=df._meta.set_index(['idx', 'id']).index)

# Implement using apply_concat_apply ("aca")
# More details: https://blog.dask.org/2019/10/08/df-groupby
def identity(x): return x

def my_unstack(x):
    return x.set_index(['id', 'idx', 'category'], drop=True).unstack()
    
def combine(x):
    return x.groupby(level=[0, 1]).sum()

result = dd.core.apply_concat_apply([df], 
                   chunk=identity, 
                   aggregate=my_unstack, 
                   combine=combine,
                   meta=meta)

result.compute()

Option B: map_partitions

If you are already able to sort the data according to at least one of idx or id, then you could also simply use map_partitions and treat each partition as a Pandas dataframe.

This should lead to significant improvement in memory usage and performance overall.

# df has sorted index `idx` in this scenario

category = 'category'
existing_categories = df[category].astype(category).cat.as_known().cat.categories
categories = [('value', cat) for cat in existing_categories]

new_columns = pd.MultiIndex.from_tuples(categories, names=(None, category))

meta = pd.DataFrame(columns=new_columns, 
                    index=df._meta.set_index(['idx', 'id']).index)

def unstack_add_columns(x):
    x = x.set_index(['id', 'category'], append=True, drop=True).unstack()
    # make sure that result contains all necessary columns
    return x.reindex(columns=new_columns) 

df.map_partitions(unstack_add_columns, meta=meta)

If you cannot guarantee idx will be sorted, you could possibly try something like

df_sorted = df.set_index('idx')
# I recommend saving to disk in between set_index and the rest
df_sorted.to_parquet('data-sorted.parq')

but that might itself bring problems with memory.

Dahn
  • 1,397
  • 1
  • 10
  • 29
  • I wasn't aware of this paradigm, thanks! I'll check this with a reasonable amount of data and accept it once I'm sure it works. – sobek Oct 11 '21 at 10:54
  • Thanks, that will certainly help give some credence to this answer. – Dahn Oct 11 '21 at 10:56
  • On a small subset of the data this seems to work perfectly, on larger subsets I always run into out-of-memory issues though. I've tried playing with the blocksize when I load the data from avro, but that hasn't really helped. Do you think that repartitioning could help before performing this pivot step? – sobek Oct 12 '21 at 11:52
  • @sobek `compute` will compute the entire dataframe and return the result of the computation to your local memory. For larger datasets, you probably want to again save to disk instead, or process the data further into some smaller form. – Dahn Oct 12 '21 at 12:04
  • Ah sorry that was unclear of me, I do indeed save the result to `parquet`. I still run into OOM issues. – sobek Oct 12 '21 at 12:06
  • That's always a bit of a difficult question. `npartitions`, `n_workers`, `threads_per_worker`, and `memory_limit` (the last controls when data gets spilled to disk, doesn't truly "limit" memory usage) are parameters I'd recommend to try first. It's very dependent on the data layout as well though. – Dahn Oct 12 '21 at 12:26
  • I've tried testing this on larger data and made some changes to `my_unstack` to account for the fact that it needs to be able to handle the resulting dataframes as well. Nothing regarding performance though. – Dahn Oct 12 '21 at 13:09
  • @sobek I have tested the solution and while it was faster than Pandas, it also will eventually need to load the whole dataset and thus is memory-heavy, since individual rows could be scattered across partitions. I have added another option into the answer, if you can guarantee `idx` to be sorted before, then this should hopefully do the trick. – Dahn Oct 12 '21 at 14:08
  • Option B works nicely, I just had to wrangle the frame some more to get back to a non-multiindexed representation, see my answer below. Thank you very much for your in-depth help! – sobek Oct 13 '21 at 10:00
  • Glad it helped! – Dahn Oct 13 '21 at 10:06
  • Dask says it doesn't support multiindex, yet a multiindex appears here, how does it make sense? :) – creanion Oct 31 '21 at 10:13
  • @creanion that's because the index is set on a Pandas dataframe in `map_partitions`. Dask doesn't know about it, it still thinks just the `idx` is the index. – Dahn Oct 31 '21 at 16:14
  • Is that also true in the `apply_concat_apply`? – creanion Oct 31 '21 at 16:36
  • 1
    Yes, each of the functions in `apply_concat_apply` operates on Pandas dataframes – Dahn Oct 31 '21 at 17:09
1

As an addendum to Dahn's answer, to get back to a non-multilevel-indexed frame I did the following:

meta = pd.DataFrame(
        columns=['level_0', 'idx', 'id'] + [x for x in existing_categories],
        index=df._meta.reset_index().index
    )

def reset_index(x):
    x = x.droplevel(0, axis=1)
    x.columns.name = None
    return x.reset_index()

df = df.map_partitions(reset_index, meta=meta).drop('level_0', axis=1)

There may be a more elegant solution to achieve this, but it works for me.

sobek
  • 1,386
  • 10
  • 28