3

I have this pandas dataframe with column "Code" that contains the sequential hierarchical code. My goal is to create new columns with each hierarchical level code and its name as followed:

Original data:

    Code    Name
0   A       USA
1   AM      Massachusetts
2   AMB     Boston
3   AMS     Springfield
4   D       Germany
5   DB      Brandenburg
6   DBB     Berlin
7   DBD     Dresden

My Goal:

Code    Name           Level1   Level1Name      Level2  Level2Name      Level3      Level3Name
0   A   USA             A           USA          AM     Massachusetts   AMB         Boston
1   AM  Massachusetts   A           USA          AM     Massachusetts   AMB         Boston
2   AMB Boston          A           USA          AM     Massachusetts   AMB         Boston
3   AMS Springfield     A           USA          AM     Massachusetts   AMS         Springfiled
4   D   Germany         D           Germany      DB     Brandenburg     DBB         Berlin
5   DB  Brandenburg     D           Germany      DB     Brandenburg     DBB         Berlin
6   DBB Berlin          D           Germany      DB     Brandenburg     DBB         Berlin
7   DBD Dresden         D           Germany      DB     Brandenburg     DBD         Dresden

My Code:

import pandas as pd
df = pd.read_excel(r'/Users/BoBoMann/Desktop/Sequence.xlsx')
df['Length']=test.Code.str.len() ## create a column with length of each cell in Code
df['Level1']=test.Code.str[:1]   ## create the first level using string indexing
df['Level1Name'] = df[df['Length']==1]['Name']
df.head() ## This yields:



Code    Name          Length    Level1  Level1Name
0   A       USA             1         A     USA
1   AM      Massachusetts   2         A     NaN
2   AMB     Boston          3         A     NaN
3   AMS     Springfield     3         A     NaN
4   D       Germany         1         D     Germany
5   DB      Brandenburg     2         D     NaN
6   DBB     Berlin          3         D     NaN
7   DBD     Dresden         3         D     NaN

For my current approach, how do I turn those NaN into USA and Germany respectively in Level1Name column?

Generally, is there a better approach to reach my goal of creating columns for each hierarchical layer and match them with their respective name in another column?

Dharman
  • 30,962
  • 25
  • 85
  • 135
BoBoMann
  • 101
  • 1
  • 7
  • To answer your question about turning those NaN to USA and Germany, you can use `ffill` like this `df['Level1Name'].ffill()`. However, I provided a different approach below. – Scott Boston Dec 26 '19 at 06:08
  • i am confused why there is no level 4 with all `Springfield` and `Dresden` , can you please explain why – anky Dec 26 '19 at 07:36
  • 1
    @anky_91: There are no level 4 for Springfield and Dresden because their code length is still 3. This is modeled after an accounting structure, where each additional letter represents a hierarchical layer. And the same layer may have different branches as the case with Dresden and Springfield -- Hope that answers your question – BoBoMann Dec 26 '19 at 13:17

1 Answers1

2

IIUC, let's use this code:

df['Codes'] = [[*i] for i in df['Code']]
df_level = df['Code'].str.extractall('(.)')[0].unstack('match').bfill().cumsum(axis=1)
s_map = df.explode('Codes').drop_duplicates('Code', keep='last').set_index('Code')['Name']
df_level.columns = [f'Level{i+1}' for i in df_level.columns]
df_level_names =  pd.concat([df_level[i].map(s_map) for i in df_level.columns], 
                            axis=1, 
                            keys=df_level.columns+'Name')
df_out = df.join([df_level, df_level_names]).drop('Codes', axis=1)
df_out

Output:

  Code           Name Level1 Level2 Level3 Level1Name     Level2Name   Level3Name
0    A            USA      A     AM    AMB        USA  Massachusetts       Boston
1   AM  Massachusetts      A     AM    AMB        USA  Massachusetts       Boston
2  AMB         Boston      A     AM    AMB        USA  Massachusetts       Boston
3  AMS    Springfield      A     AM    AMS        USA  Massachusetts  Springfield
4    D        Germany      D     DB    DBB    Germany    Brandenburg       Berlin
5   DB    Brandenburg      D     DB    DBB    Germany    Brandenburg       Berlin
6  DBB         Berlin      D     DB    DBB    Germany    Brandenburg       Berlin
7  DBD        Dresden      D     DB    DBD    Germany    Brandenburg      Dresden

Explained:

  • Unpack string into a list of characters creating 'Codes' column
  • Create 'LevelX' columns using extractall and regex . to get a single character, then bfill NaN above and cumsum along rows to create 'LevelX' columns
  • Create a pd.Series to use with map by calling explode on 'Codes' column create above and drop_duplicates keep the last value of 'Code' and then set_index on 'Codes' and keep 'Name' column to create 's_map'.
  • Rename name df_level columns to get Level1 instead of Level0.
  • Use pd.concat with list comprehension to map df_level columns to df_level_names using s_map. Also, using keys parameter to rename new columns and appending 'Name'
  • Use join to join df with df_levels and df_level_names, then drop the 'Codes' column, creating the desired output.
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • @BoBoMann did this answer help you? – Scott Boston Dec 26 '19 at 16:30
  • 1
    @anky_91 Yeah, so much depends on the time of day you answer questions. I just enjoy working on interesting problems. – Scott Boston Dec 26 '19 at 21:39
  • 1
    @Scott Boston: Sorry it took me a bit to respond. At first, I didn't know why the explode function didn't work and took me quite some time to realize that I was using pandas version 0.22 instead of 0.25. Once I updated my jupyter and pandas, your code work perfectly. Thank you so much! – BoBoMann Dec 26 '19 at 23:20
  • 1
    @Scott Boston: By the way, is there a way to arrange the column names in df_out as stated in "My Goal" section, in the order of Level1,Level1Name, Level2, Level2Name and so on. This data is modeled after an accounting structure , which could have up to 20 levels. Thank you! – BoBoMann Dec 26 '19 at 23:36
  • 1
    Sure. use sort_index with axis=1 it will sort them alpabetically. You move all the columns you don't wanted sorted into the index, then sort_index with axis=1 basically sorting the columns, then use reset_index to bring those other columns back into the dataframe from the index. – Scott Boston Dec 26 '19 at 23:42
  • @Scott Boston: Thank you so much! and Sorry I am new here, will definitely accept the answer! – BoBoMann Dec 26 '19 at 23:48
  • @BoBoMann Thank you. Happy Coding. – Scott Boston Dec 26 '19 at 23:49
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/204911/discussion-between-bobomann-and-scott-boston). – BoBoMann Dec 26 '19 at 23:50
  • I looked at the code again and want to ask at s_map step, why do we need to explode? How is that different than s_map = df.set_index('Code')['Name']? --Thank you! – BoBoMann Dec 27 '19 at 20:02