Efficient sorting of select rows within same timestamps according to custom order

Question

Note for the suggested duplicate on this question: Categorical sorting does not work for this as it only uses a subset of the strings in the column for sorting. if you set this to a categorical index it will make all the non listed 'categories' /strings nulls.

Original Question: I have a working example but I feel like there must be a better / more efficient way to calculate these results.

I have a large data frame of machine data where the order of events is not maintained properly within each timestamp. This looks like the input event column below. You can see the selected events have been reordered according to the event_order list within each timestamp.

Input is event. Desired output is sorted_output event in last column. Horizontal lines added to show sorting is only within each timestamp block.

timestamps have been simplified to integers. event names have been simplified as well. These are not the alphabet but full string names in the non example data.

Is there a more efficient way to do this?

                  input      sorted_output
    timestamp     event      event
0           0      wer       wer   
_________________________________
1           1       up       dog
2           1      def       def
3           1      abc       abc
4           1      dog      fast
5           1      prq       prq
6           1      cde       cde
7           1     fast        up
8           1      bnm       bnm
_________________________________
9           2      ert       ert
10          2      and       and
11          2      ert       ert
12          2      ghj       ghj
13          2  streets      down
14          2     down   streets
_________________________________
15          3     runs       dog
16          3      dog      runs
17          3      ert       ert
18          3       up        up
19          3      dfg       dfg
20          3      prq       prq

Working code

import pandas as pd

df = pd.DataFrame(
    [
        {'timestamp': 0, 'event': 'wer'},
        {'timestamp': 1, 'event': 'up'},
        {'timestamp': 1, 'event': 'def'},
        {'timestamp': 1, 'event': 'abc'},
        {'timestamp': 1, 'event': 'dog'},
        {'timestamp': 1, 'event': 'prq'},
        {'timestamp': 1, 'event': 'cde'},
        {'timestamp': 1, 'event': 'fast'},
        {'timestamp': 1, 'event': 'bnm'},
        {'timestamp': 2, 'event': 'ert'},
        {'timestamp': 2, 'event': 'and'},
        {'timestamp': 2, 'event': 'ert'},
        {'timestamp': 2, 'event': 'ghj'},
        {'timestamp': 2, 'event': 'streets'},
        {'timestamp': 2, 'event': 'down'},
        {'timestamp': 3, 'event': 'runs'},
        {'timestamp': 3, 'event': 'dog'},
        {'timestamp': 3, 'event': 'ert'},
        {'timestamp': 3, 'event': 'up'},
        {'timestamp': 3, 'event': 'dfg'},
        {'timestamp': 3, 'event': 'prq'},
    ]
)
df = df[['timestamp', 'event']]

# events to sort in order (they aren't actually alphabetical this is mock data)
events_to_sort = ['dog', 'runs', 'fast', 'up', 'and', 'down', 'streets']

# this method gleaned from here https://stackoverflow.com/questions/23482668/sorting-by-a-custom-list-in-pandas
sorter_index = dict(zip(events_to_sort, range(len(events_to_sort))))

# create a temporary rank column for sorting
df['sort_col'] = df['event'].map(sorter_index)

ev_ind = df.event.isin(events_to_sort)

# loop through each timestamp block
for time in df.timestamp.unique():
    # limit to only sortable events within the timestamp
    section_index = df.timestamp.eq(time) & ev_ind
    df_temp = df.loc[section_index]

    if len(df_temp) > 1:
        # if there is more than 1 sortable event tag sort and set the values back to the original df
        df.loc[section_index, 'event'] = df_temp.sort_values(by='sort_col')['event'].values

# drop temp sorting col
df = df.drop('sort_col', axis=1)

@KenHBS I do not think the link address this problem – BENY Jun 24 '19 at 16:38 — BENY, Jun 24 '19 at 16:38

BENY · Accepted Answer · 2019-06-24T17:29:30.047

3

In your case

s=df.loc[df.event.isin(events_to_sort)].copy()
s.event=pd.Categorical(s.event,categories=events_to_sort,ordered=True)
s=s.sort_values(['timestamp','event'])
s.index=sorted(s.index)
df=s.combine_first(df)

edited Jun 24 '19 at 17:29

answered Jun 24 '19 at 16:28

BENY

317,841
20
164
234

I think you missed the note about the tags not actually being alphabetical except in this example data. This method will not work in current state if the tags are not alphabetical. Maybe if we take this and combine with a set categorical on the series before sorting it will then work. – ak_slick Jun 24 '19 at 16:48
@ak_slick check the update – BENY Jun 24 '19 at 17:29

score 0 · Answer 2 · answered Jun 24 '19 at 17:31

WeNYoBen's answer got me thinking and filled in a missing piece of the puzzle for me. Here are two solutions that will work. One using categorical sorting and one using the mapped sorting.

Solution 1 map sorting (uses additional sort column)

sorter_index = dict(zip(events_to_sort, range(len(events_to_sort))))

# get subset to sort
s = df.loc[df.event.isin(events_to_sort)].copy()

# make sort column
s['sort_col'] = s['event'].map(sorter_index)

# do sorting by sort columns first then timestamp
s = s.sort_values(['timestamp', 'sort_col'])

# reorder the index such that they will insert back into original df properly
s.index = sorted(s.index)

# remove the temporary sort_col
s.drop('sort_col', axis=1, inplace=True)

# place sorted events back into original df in the correct location
df = s.combine_first(df)

Solution 2 categorical sorting

# get subset to sort
s = df.loc[df.event.isin(events_to_sort)].copy()

# convert event column to categorical type
s.event = s.event.astype('category')

# set category sort order
s['event'] = s['event'].cat.set_categories(events_to_sort)


# sort by event then timestamp
s = s.sort_values(['timestamp', 'event'])

# reorder the index such that they will insert back into original df properly
s.index = sorted(s.index)

# place sorted events back into original df in the correct location
df = s.combine_first(df)

Both output:

    timestamp    event
0         0.0      wer
1         1.0      dog
2         1.0      def
3         1.0      abc
4         1.0     fast
5         1.0      prq
6         1.0      cde
7         1.0       up
8         1.0      bnm
9         2.0      ert
10        2.0      and
11        2.0      ert
12        2.0      ghj
13        2.0     down
14        2.0  streets
15        3.0      dog
16        3.0     runs
17        3.0      ert
18        3.0       up
19        3.0      dfg
20        3.0      prq

Efficient sorting of select rows within same timestamps according to custom order

2 Answers2