How to match column values and extract indices in siuba?

Question

Objective and data

My goal is to look for the values of preceding in vehicle_id at a given frame_id and extract the corresponding value of v_vel in a new column called preceding_vel. I want to use the siuba python package for this purpose. Following is my dataframe:

    import pandas as pd
    
    df_mini_dict = {'vehicle_id': {884: 2, 885: 2, 886: 2, 14148: 44, 14149: 44, 14150: 44}, 
'frame_id': {884: 338, 885: 339, 886: 340, 14148: 338, 14149: 339, 14150: 340}, 
'preceding': {884: 44, 885: 44, 886: 44, 14148: 3355, 14149: 3355, 14150: 3355}, 
'v_vel': {884: 6.299857770322456, 885: 6.427411525504063, 886: 6.590098168958994, 14148: 7.22883474245701, 14149: 6.973590500351793, 14150: 6.727721962795176}}
    
    df_mini = pd.DataFrame.from_dict(df_mini_dict)

Working R solution

I can achieve the objective by using the following code:

df_mini <- structure(list(vehicle_id = c(2L, 2L, 2L, 44L, 44L, 44L), 
                          frame_id = c(338L, 339L, 340L, 338L, 339L, 340L), 
                          preceding = c(44L, 44L, 44L, 3355L, 3355L, 3355L), 
                          v_vel = c(6.29985777032246, 6.42741152550406, 
                                    6.59009816895899, 7.22883474245701, 
                                    6.97359050035179, 6.72772196279518), 
                          preceding_vel = c(7.22883474245701, 6.97359050035179, 
                                            6.72772196279518, NA, NA, NA)), 
                     class = c("tbl_df", "tbl", "data.frame"), 
                     row.names = c(NA, -6L))

library(dplyr)

df_mini <- df_mini |> 
  dplyr::group_by(frame_id) |>  
  dplyr::mutate(preceding_vel = v_vel[match(preceding, vehicle_id)]) |> 
  dplyr::ungroup()

Python attempt

Essentially, I am trying to do in siuba what dplyr is doing but it seems that I need to use index() to do what match does. I tried the following unsuccessfully:

def match(x, table):
  indicez = []
  for i in x:
    indicez.append(table.index(i))
  return indicez

from siuba import *

df_mini = (
    df_mini
    >> group_by(_.frame_id)  # grouping by frame id
    >> mutate(preceding_vel = _.v_vel[match(_.preceding, _.vehicle_id)])
)

TypeError: 'Symbolic' object is not iterable

Please guide me what is the best way to define the match function or use something else to meet the objective. Thanks.

Notice that to use `siuba` you will have to define a function that dispatches the type in question. Also since match can return `NA`, python does not have NA integers, you will have to define a `__getitem__` method that deals with this. Might be overboard. Instead just use pandas as shown below — Onyambu, Jul 18 '23 at 21:56

Onyambu · Answer 1 · 2023-07-18T21:00:56.530

use merge:

df_mini.merge(df_mini.drop(columns='preceding') , 
              left_on = ['frame_id', 'preceding'], 
              right_on = ['frame_id', 'vehicle_id'], how = 'left', 
              suffixes=['', '_y'])
Out[17]: 
   vehicle_id  frame_id  preceding     v_vel  vehicle_id_y   v_vel_y
0           2       338         44  6.299858          44.0  7.228835
1           2       339         44  6.427412          44.0  6.973591
2           2       340         44  6.590098          44.0  6.727722
3          44       338       3355  7.228835           NaN       NaN
4          44       339       3355  6.973591           NaN       NaN
5          44       340       3355  6.727722           NaN       NaN

Same logic in R:

df_mini %>%
   left_join(select(., !preceding),
             c(preceding = 'vehicle_id', frame_id = 'frame_id'), 
               suffix = c("", "_y"))

# A tibble: 6 × 5
  vehicle_id frame_id preceding v_vel v_vel_y
       <int>    <int>     <int> <dbl>   <dbl>
1          2      338        44  6.30    7.23
2          2      339        44  6.43    6.97
3          2      340        44  6.59    6.73
4         44      338      3355  7.23   NA   
5         44      339      3355  6.97   NA   
6         44      340      3355  6.73   NA

Thank you! This solution is user-friendly. – umair durrani Jul 19 '23 at 14:56 — umair durrani, Jul 19 '23 at 14:56

score 1 · Accepted Answer · answered Jul 19 '23 at 01:50

Using siuba as requested:

We will modify match function as shown below:

from siuba import _, mutate, group_by
from siuba.siu import symbolic_dispatch
import pandas as pd

@symbolic_dispatch(cls = pd.Series)
def match_get(a, b, c):
    b_dict = {x: int(i) for i, x in enumerate(b)}
    return [None if (d:=b_dict.get(x, None)) is None else c[d] for x in a]    

(df_mini >> 
 group_by(_.frame_id) >> 
 mutate(preceding_val = match_get(_.preceding, _.vehicle_id, _.v_vel.values)))

       vehicle_id  frame_id  preceding     v_vel  preceding_val
884             2       338         44  6.299858       7.228835
885             2       339         44  6.427412       6.973591
886             2       340         44  6.590098       6.727722
14148          44       338       3355  7.228835            NaN
14149          44       339       3355  6.973591            NaN
14150          44       340       3355  6.727722            NaN

thank you so much for providing this solution. I am new to python data analysis, so I was wondering if `@symbolic_dispatch` is something a user of siuba would write or if it is generally used by a developer? — umair durrani, Jul 19 '23 at 14:55
@umairdurrani users write dispatch method all the time. Not only here — Onyambu, Jul 19 '23 at 15:57

How to match column values and extract indices in siuba?

Objective and data

Working R solution

Python attempt

2 Answers2