Creating new columns based on value from another column in pandas

Question

I have this pandas dataframe with column "Code" that contains the sequential hierarchical code. My goal is to create new columns with each hierarchical level code and its name as followed:

Original data:

    Code    Name
0   A       USA
1   AM      Massachusetts
2   AMB     Boston
3   AMS     Springfield
4   D       Germany
5   DB      Brandenburg
6   DBB     Berlin
7   DBD     Dresden

My Goal:

Code    Name           Level1   Level1Name      Level2  Level2Name      Level3      Level3Name
0   A   USA             A           USA          AM     Massachusetts   AMB         Boston
1   AM  Massachusetts   A           USA          AM     Massachusetts   AMB         Boston
2   AMB Boston          A           USA          AM     Massachusetts   AMB         Boston
3   AMS Springfield     A           USA          AM     Massachusetts   AMS         Springfiled
4   D   Germany         D           Germany      DB     Brandenburg     DBB         Berlin
5   DB  Brandenburg     D           Germany      DB     Brandenburg     DBB         Berlin
6   DBB Berlin          D           Germany      DB     Brandenburg     DBB         Berlin
7   DBD Dresden         D           Germany      DB     Brandenburg     DBD         Dresden

My Code:

import pandas as pd
df = pd.read_excel(r'/Users/BoBoMann/Desktop/Sequence.xlsx')
df['Length']=test.Code.str.len() ## create a column with length of each cell in Code
df['Level1']=test.Code.str[:1]   ## create the first level using string indexing
df['Level1Name'] = df[df['Length']==1]['Name']
df.head() ## This yields:



Code    Name          Length    Level1  Level1Name
0   A       USA             1         A     USA
1   AM      Massachusetts   2         A     NaN
2   AMB     Boston          3         A     NaN
3   AMS     Springfield     3         A     NaN
4   D       Germany         1         D     Germany
5   DB      Brandenburg     2         D     NaN
6   DBB     Berlin          3         D     NaN
7   DBD     Dresden         3         D     NaN

For my current approach, how do I turn those NaN into USA and Germany respectively in Level1Name column?

Generally, is there a better approach to reach my goal of creating columns for each hierarchical layer and match them with their respective name in another column?

To answer your question about turning those NaN to USA and Germany, you can use `ffill` like this `df['Level1Name'].ffill()`. However, I provided a different approach below. — Scott Boston, Dec 26 '19 at 06:08
i am confused why there is no level 4 with all `Springfield` and `Dresden` , can you please explain why — anky, Dec 26 '19 at 07:36
@anky_91: There are no level 4 for Springfield and Dresden because their code length is still 3. This is modeled after an accounting structure, where each additional letter represents a hierarchical layer. And the same layer may have different branches as the case with Dresden and Springfield -- Hope that answers your question — BoBoMann, Dec 26 '19 at 13:17

Scott Boston · Accepted Answer · 2019-12-26T06:04:07.190

2

IIUC, let's use this code:

df['Codes'] = [[*i] for i in df['Code']]
df_level = df['Code'].str.extractall('(.)')[0].unstack('match').bfill().cumsum(axis=1)
s_map = df.explode('Codes').drop_duplicates('Code', keep='last').set_index('Code')['Name']
df_level.columns = [f'Level{i+1}' for i in df_level.columns]
df_level_names =  pd.concat([df_level[i].map(s_map) for i in df_level.columns], 
                            axis=1, 
                            keys=df_level.columns+'Name')
df_out = df.join([df_level, df_level_names]).drop('Codes', axis=1)
df_out

Output:

  Code           Name Level1 Level2 Level3 Level1Name     Level2Name   Level3Name
0    A            USA      A     AM    AMB        USA  Massachusetts       Boston
1   AM  Massachusetts      A     AM    AMB        USA  Massachusetts       Boston
2  AMB         Boston      A     AM    AMB        USA  Massachusetts       Boston
3  AMS    Springfield      A     AM    AMS        USA  Massachusetts  Springfield
4    D        Germany      D     DB    DBB    Germany    Brandenburg       Berlin
5   DB    Brandenburg      D     DB    DBB    Germany    Brandenburg       Berlin
6  DBB         Berlin      D     DB    DBB    Germany    Brandenburg       Berlin
7  DBD        Dresden      D     DB    DBD    Germany    Brandenburg      Dresden

Explained:

Unpack string into a list of characters creating 'Codes' column
Create 'LevelX' columns using extractall and regex . to get a single character, then bfill NaN above and cumsum along rows to create 'LevelX' columns
Create a pd.Series to use with map by calling explode on 'Codes' column create above and drop_duplicates keep the last value of 'Code' and then set_index on 'Codes' and keep 'Name' column to create 's_map'.
Rename name df_level columns to get Level1 instead of Level0.
Use pd.concat with list comprehension to map df_level columns to df_level_names using s_map. Also, using keys parameter to rename new columns and appending 'Name'
Use join to join df with df_levels and df_level_names, then drop the 'Codes' column, creating the desired output.

edited Dec 26 '19 at 06:04

answered Dec 26 '19 at 05:37

Scott Boston

147,308
15
139
187

@BoBoMann did this answer help you? – Scott Boston Dec 26 '19 at 16:30
1

@anky_91 Yeah, so much depends on the time of day you answer questions. I just enjoy working on interesting problems. – Scott Boston Dec 26 '19 at 21:39
1

@Scott Boston: Sorry it took me a bit to respond. At first, I didn't know why the explode function didn't work and took me quite some time to realize that I was using pandas version 0.22 instead of 0.25. Once I updated my jupyter and pandas, your code work perfectly. Thank you so much! – BoBoMann Dec 26 '19 at 23:20
1

@Scott Boston: By the way, is there a way to arrange the column names in df_out as stated in "My Goal" section, in the order of Level1,Level1Name, Level2, Level2Name and so on. This data is modeled after an accounting structure , which could have up to 20 levels. Thank you! – BoBoMann Dec 26 '19 at 23:36
1

Sure. use sort_index with axis=1 it will sort them alpabetically. You move all the columns you don't wanted sorted into the index, then sort_index with axis=1 basically sorting the columns, then use reset_index to bring those other columns back into the dataframe from the index. – Scott Boston Dec 26 '19 at 23:42
@Scott Boston: Thank you so much! and Sorry I am new here, will definitely accept the answer! – BoBoMann Dec 26 '19 at 23:48
@BoBoMann Thank you. Happy Coding. – Scott Boston Dec 26 '19 at 23:49
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/204911/discussion-between-bobomann-and-scott-boston). – BoBoMann Dec 26 '19 at 23:50
I looked at the code again and want to ask at s_map step, why do we need to explode? How is that different than s_map = df.set_index('Code')['Name']? --Thank you! – BoBoMann Dec 27 '19 at 20:02

Creating new columns based on value from another column in pandas

1 Answers1

Explained:

Linked