I have a large dataset in the form of the following dataframe that I previously loaded from avro files
timestamp | id | category | value |
---|---|---|---|
2021-01-01 00:00:00+00:00 | a | d | g |
2021-01-01 00:10:00+00:00 | a | d | h |
2021-01-01 00:10:00+00:00 | a | e | h |
2021-01-01 00:00:00+00:00 | b | e | h |
I would like to pivot the category
column (which contains on the order of 50 different categories) and kind of deduplicate along the timestamp
and id
columns so the result looks like this
id | timestamp | d | e |
---|---|---|---|
a | 2021-01-01 00:00:00+00:00 | g | nan |
a | 2021-01-01 00:10:00+00:00 | h | h |
b | 2021-01-01 00:00:00+00:00 | nan | h |
I know how I would achieve this in pandas
using multiindices together with the stack
/unstack
operations, however my dataset is way too large to use pandas
without manual batch processing and dask
does not support multiindices. Is there some way this can be efficiently done with dask
?
Edit:
As noted by @Dahn, I've created a minimal synthetic example with pandas:
import pandas as pd
records = [
{'idx': 0, 'id': 'a', 'category': 'd', 'value': 1},
{'idx': 1, 'id': 'a', 'category': 'e', 'value': 2},
{'idx': 2, 'id': 'a', 'category': 'f', 'value': 3},
{'idx': 0, 'id': 'b', 'category': 'd', 'value': 4},
{'idx': 1, 'id': 'c', 'category': 'e', 'value': 5},
{'idx': 2, 'id': 'c', 'category': 'f', 'value': 6}
]
frame = pd.DataFrame(records)
idx id category value
0 0 a d 1
1 1 a e 2
2 2 a f 3
3 0 b d 4
4 1 c e 5
5 2 c f 6
frame = frame.set_index(['id', 'idx', 'category'], drop=True).unstack().droplevel(0, axis=1).reset_index()
frame.columns.name = ''
id idx d e f
0 a 0 1.0 NaN NaN
1 a 1 NaN 2.0 NaN
2 a 2 NaN NaN 3.0
3 b 0 4.0 NaN NaN
4 c 1 NaN 5.0 NaN
5 c 2 NaN NaN 6.0