I have some ordered data where there is a hierarchy of events. Each column is a unique id of an event with respect to the event above it in the hierarchy. Something similar to how each day number is unique in a month, and each month number is unique in a year. I want to get the lowest level to be unique within the highest level, like making every day unique in a year by numbering from 1 to 365. My use case is not specific to days, months, and years.
Before:
| ID | EVENT_1 | EVENT_2 | EVENT_3 |
| -- | ------- | ------- | ------- |
| 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 2 |
| 1 | 1 | 1 | 3 |
| 1 | 1 | 2 | 1 |
| 1 | 1 | 2 | 2 |
| 1 | 1 | 3 | 1 |
| 1 | 1 | 3 | 2 |
| 1 | 2 | 1 | 1 |
| 1 | 2 | 1 | 2 |
After:
| ID | EVENT_1 | EVENT_2 | EVENT_3 | EVENT_3A |
| -- | ------- | ------- | ------- | -------- |
| 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 2 | 2 |
| 1 | 1 | 1 | 3 | 3 |
| 1 | 1 | 2 | 1 | 4 |
| 1 | 1 | 2 | 2 | 5 |
| 1 | 1 | 3 | 1 | 6 |
| 1 | 1 | 3 | 2 | 7 |
| 1 | 2 | 1 | 1 | 1 |
| 1 | 2 | 1 | 2 | 2 |
The goal is to get a column where for each id, there is an EVENT_3A such that EVENT_3A is the order in which EVENT_3 happens with respect to EVENT_1 (as if there was no EVENT_2). Additionally, there are many IDs which this must be calculated independently for. Right now I am doing this on the CPU, but it takes a long time so I would like to switch to doing it on a GPU.
My main idea is to do a groupby('ID').apply_grouped()
or groupby('ID').agg()
but I do not know what to put in the apply_grouped()
or agg()
functions. I was doing this before with dask on the CPU, but it was more intuitive because the grouped DataFrame was passed directly to the apply()
function. It seems that in cuDF I have to pass the incols and I am unable to figure out how to treat those as a DataFrame.
There are approximately 5,000 IDs, so ideally each grouped ID would be processed by a core in the GPU, but I am not sure if it can work like that since I am new to programming for a GPU.
Any suggestions or solutions are helpful, thank you.