Note for the suggested duplicate on this question: Categorical sorting does not work for this as it only uses a subset of the strings in the column for sorting. if you set this to a categorical index it will make all the non listed 'categories' /strings nulls.
Original Question: I have a working example but I feel like there must be a better / more efficient way to calculate these results.
I have a large data frame of machine data where the order of events is not maintained properly within each timestamp. This looks like the input event column below. You can see the selected events have been reordered according to the event_order list within each timestamp.
Input is event. Desired output is sorted_output event in last column. Horizontal lines added to show sorting is only within each timestamp block.
timestamps have been simplified to integers. event names have been simplified as well. These are not the alphabet but full string names in the non example data.
Is there a more efficient way to do this?
input sorted_output
timestamp event event
0 0 wer wer
_________________________________
1 1 up dog
2 1 def def
3 1 abc abc
4 1 dog fast
5 1 prq prq
6 1 cde cde
7 1 fast up
8 1 bnm bnm
_________________________________
9 2 ert ert
10 2 and and
11 2 ert ert
12 2 ghj ghj
13 2 streets down
14 2 down streets
_________________________________
15 3 runs dog
16 3 dog runs
17 3 ert ert
18 3 up up
19 3 dfg dfg
20 3 prq prq
Working code
import pandas as pd
df = pd.DataFrame(
[
{'timestamp': 0, 'event': 'wer'},
{'timestamp': 1, 'event': 'up'},
{'timestamp': 1, 'event': 'def'},
{'timestamp': 1, 'event': 'abc'},
{'timestamp': 1, 'event': 'dog'},
{'timestamp': 1, 'event': 'prq'},
{'timestamp': 1, 'event': 'cde'},
{'timestamp': 1, 'event': 'fast'},
{'timestamp': 1, 'event': 'bnm'},
{'timestamp': 2, 'event': 'ert'},
{'timestamp': 2, 'event': 'and'},
{'timestamp': 2, 'event': 'ert'},
{'timestamp': 2, 'event': 'ghj'},
{'timestamp': 2, 'event': 'streets'},
{'timestamp': 2, 'event': 'down'},
{'timestamp': 3, 'event': 'runs'},
{'timestamp': 3, 'event': 'dog'},
{'timestamp': 3, 'event': 'ert'},
{'timestamp': 3, 'event': 'up'},
{'timestamp': 3, 'event': 'dfg'},
{'timestamp': 3, 'event': 'prq'},
]
)
df = df[['timestamp', 'event']]
# events to sort in order (they aren't actually alphabetical this is mock data)
events_to_sort = ['dog', 'runs', 'fast', 'up', 'and', 'down', 'streets']
# this method gleaned from here https://stackoverflow.com/questions/23482668/sorting-by-a-custom-list-in-pandas
sorter_index = dict(zip(events_to_sort, range(len(events_to_sort))))
# create a temporary rank column for sorting
df['sort_col'] = df['event'].map(sorter_index)
ev_ind = df.event.isin(events_to_sort)
# loop through each timestamp block
for time in df.timestamp.unique():
# limit to only sortable events within the timestamp
section_index = df.timestamp.eq(time) & ev_ind
df_temp = df.loc[section_index]
if len(df_temp) > 1:
# if there is more than 1 sortable event tag sort and set the values back to the original df
df.loc[section_index, 'event'] = df_temp.sort_values(by='sort_col')['event'].values
# drop temp sorting col
df = df.drop('sort_col', axis=1)