2

I want to be able to get the manager_id of each manager recursively using a polars DF which has two columns:

"employee_id", "manager_1_id"

In pandas, this code was:

id_index = df.set_index("employee_id")["manager_1_id"]
for i in range(1, 12):
    df[f"manager_{str(i + 1)}_id"] = df[f"manager_{str(i)}_id"].map(id_index)

Each manager_id value is also an employee ID and ultimately I want a column per manager:

"employee_id, manager_1_id, manager_2_id, manager_3_id, ..."

Is there a good way to achieve this with polars without running the pandas snippet? I was attempting to loop through some left joins, but it did not seem like a great approach.

Edit: example is below. The raw data has two columns, the employee ID (all many thousands of employees) and the employee ID of their direct manager.

employee_id | manager_1_id
    1       |      3
    2       |      5
    3       |      4
    4       |      5
    5       |

Goal is to expand this out into columns (manager_1 to manager_12)

employee_id | manager_1_id | manager_2_id | manager_3_id | ...
     1              3             4              5
     2              5       
     3              4             5
     5              5
     5

Hopefully that is clear. Employee 1 reports to Employee 3, who reports to Employee 4, who reports to Employee 5. Employee 5 is the CEO who reports to no one.

Laurent
  • 12,287
  • 7
  • 21
  • 37
ldacey
  • 518
  • 8
  • 16

2 Answers2

3

Here is one way to do the equivalent of your Pandas snippet in Polars:

import polars as pl

df = pl.DataFrame({"employee_id": range(1, 6), "manager_1_id": [3, 5, 4, 5, None]})
id_index = dict(df.iter_rows())
for i in range(1, df.select("manager_1_id").max().item()):
    df = df.with_columns(
        pl.col(f"manager_{i}_id")
        .map_dict(id_index)
        .alias(f"manager_{i+1}_id")
    )
# Remove empty columns
df = df[[s.name for s in df if not (s.null_count() == df.height)]].fill_null(0)

Then:

print(df)
# Output

┌─────────────┬──────────────┬──────────────┬──────────────┐
│ employee_id ┆ manager_1_id ┆ manager_2_id ┆ manager_3_id │
│ ---         ┆ ---          ┆ ---          ┆ ---          │
│ i64         ┆ i64          ┆ i64          ┆ i64          │
╞═════════════╪══════════════╪══════════════╪══════════════╡
│ 1           ┆ 3            ┆ 4            ┆ 5            │
│ 2           ┆ 5            ┆ 0            ┆ 0            │
│ 3           ┆ 4            ┆ 5            ┆ 0            │
│ 4           ┆ 5            ┆ 0            ┆ 0            │
│ 5           ┆ 0            ┆ 0            ┆ 0            │
└─────────────┴──────────────┴──────────────┴──────────────┘
Laurent
  • 12,287
  • 7
  • 21
  • 37
1

Here's a double recursive function, rather than an explicit loop.

def hierarchy(df, joinlevel=1, stacklevel=1, sofar=None, sublevel={}):
    dfs=(df
        #make a column denoting the employee isn't anyone's manager
        .with_columns(emponly=pl.col('employee_id').is_in(pl.col('manager_1_id')))
        #sort by that bool column so that the 0 and 1 index are well defined
        .sort('emponly')
        .partition_by('emponly'))
    # when called from scratch sofar is blank otherwise it's carried forward
    if sofar is None:
        sofar=dfs[0].drop('emponly')
    # when the partition doesn't have 2 dataframes that means there are no managers
    if len(dfs)<2:
        if stacklevel+1 in sublevel.keys():
            return pl.concat([sofar,sublevel[stacklevel+1]], how='diagonal') #and sublevels
        else:
            return sofar
    #This is the second level of recursion treating each manager rung of management separately
    if stacklevel+1 not in sublevel.keys():
        sublevel[stacklevel+1]=hierarchy(dfs[1].drop('emponly'), stacklevel=stacklevel+1)
    #This is the first level of recursion pulling each next level of management into more columns
    sofar=sofar.join(
        dfs[1].drop('emponly').rename({'manager_1_id':f'manager_{joinlevel+1}_id'}),
        left_on=f'manager_{joinlevel}_id', right_on='employee_id', how='left')
    #This starts the recursion over in a quasi-loop
    return hierarchy(dfs[1].drop('emponly'),
                    stacklevel=stacklevel,
                    joinlevel=joinlevel+1,
                    sofar=sofar,
                    sublevel=sublevel)

This logic of this is to partition the df into one group of employees who aren't anyone's manager and another group who are managers. It then joins the first group with the second group in order to get the initial employee's manager's manager. It recalls itself in order to join another level of management. It keeps doing that until there are no more managers. There's a separate recursion fork where it treats each mid level manager as a low level employee and repeats the above process.

df=pl.DataFrame({'employee_id':[1,2,3,4,5], 'manager_1_id':[3,5,4,5,None]})
hierarchy(df)

    shape: (5, 5)
┌─────────────┬──────────────┬──────────────┬──────────────┬──────────────┐
│ employee_id ┆ manager_1_id ┆ manager_2_id ┆ manager_3_id ┆ manager_4_id │
│ ---         ┆ ---          ┆ ---          ┆ ---          ┆ ---          │
│ i64         ┆ i64          ┆ i64          ┆ i64          ┆ i64          │
╞═════════════╪══════════════╪══════════════╪══════════════╪══════════════╡
│ 1           ┆ 3            ┆ 4            ┆ 5            ┆ null         │
│ 2           ┆ 5            ┆ null         ┆ null         ┆ null         │
│ 3           ┆ 4            ┆ 5            ┆ null         ┆ null         │
│ 4           ┆ 5            ┆ null         ┆ null         ┆ null         │
│ 5           ┆ null         ┆ null         ┆ null         ┆ null         │
└─────────────┴──────────────┴──────────────┴──────────────┴──────────────┘

The joinlevel keeps track of the management chain recursion and the stacklevel keeps track of all the times it starts over with a new level of middle management. sofar keeps the joinlevel progress and the sublevel dict keeps the mid level management.

A non-recursive version

df_ans=df
i=1
while df_ans.height>df_ans.select(pl.col(f"manager_{i}_id").null_count()).item():
    df_ans=df_ans.join(df.rename({'manager_1_id':f'manager_{i+1}_id'}), left_on=f"manager_{i}_id", right_on="employee_id", how='left')
    i+=1
df_ans
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • This worked, thank you. It was around 150ms versus the other solution at 485ms, so it is faster as well. I do find it complicated to follow since it is recursive. I tried to break it down into separate functions and I think I will go with it – ldacey May 16 '23 at 17:36
  • @ldacey yes, it was complicated to put together too. You did say "I want to be able to get the manager_id of each manager **recursively**". As to this one's faster time, the other one converts the whole df into a python dict which is very inefficient and non-polarseque. This one never loops row-wise, instead it will loop based on how many levels of management there are. – Dean MacGregor May 16 '23 at 18:40
  • @ldacey I put a non-recursive version at the end, it's much shorter and performs comparably if not better. – Dean MacGregor May 16 '23 at 18:43
  • Nice, that is the fastest solution and is easier to follow – ldacey May 27 '23 at 00:08