2

I have a DataFrame with 1mln of rows and two columns Type and Name whose values are a lists with non-unique values. Both Type and Name columns have the same number of elements because they form a pair (Type, Name). I would like to add to my DataFrame columns whose names are the unique types from Type column with the values being a list of corresponding values from Name columns. Below is a short example of the current code. It works but very slow when the number of rows is 1mln so I'm looking for a faster solution.

import pandas as pd
df = pd.DataFrame({"Type": [["1", "1", "2", "3"], ["2","3"]], "Name": [["A", "B", "C", "D"], ["E", "F"]]})

unique = list(set(df["Type"].explode()))
for t in unique:
    df[t] = None
    df[t] = df[t].astype('object')

for idx, row in df.iterrows():
    for t in unique:
        df.at[idx, t] = [row["Name"][i] for i in range(len(row["Name"])) if row["Type"][i] == t]

My desired result is: enter image description here

Jason
  • 313
  • 2
  • 8

2 Answers2

3

You can explode the whole dataframe and then use your exploded dataframe and pivot it with a list as the aggfunc (Resetting the index to use the index as the grouper for the pivot)

df.explode(column=['Type','Name']).reset_index().pivot_table(index='index',columns='Type', values='Name',aggfunc=list)

Type    1       2   3
index           
0       [A, B]  [C] [D]
1       NaN     [E] [F]

And then concat it back onto the original

pd.concat([df,df.explode(column=['Type','Name']).reset_index().pivot_table(index='index',columns='Type', values='Name',aggfunc=list)],axis=1)

    Type            Name            1       2   3
0   [1, 1, 2, 3]    [A, B, C, D]    [A, B]  [C] [D]
1   [2, 3]          [E, F]          NaN     [E] [F]

As requested, here is the code broken out by step for debugging purposes

df1=df.explode(column=['Type','Name'])
df1=df1.reset_index()
pvt=df1.pivot_table(index='index',columns='Type', values='Name',aggfunc=list)
pd.concat([df,pvt],axis=1)
G. Anderson
  • 5,815
  • 2
  • 14
  • 21
  • I'm getting "ValueError: column must be a scalar" – Jason Jun 30 '22 at 17:52
  • I'm not able to reproduce. The data you provided in the question produces the desired result. At what step do you get that error? You can break the first step in my code into multiple steps, first explode, then reset index, then pivot it and see where the issue occurs – G. Anderson Jun 30 '22 at 17:57
  • Can you add the entire code? The explode() function returns this error – Jason Jun 30 '22 at 18:12
  • 1
    According to [this question](https://stackoverflow.com/questions/61313268/pandas-explode-error-column-must-be-scalar) you may need to update your pandas version to allow you to pass a list of columns to explode – G. Anderson Jun 30 '22 at 18:18
2

Alternative solution:

df = pd.DataFrame({"Type": [["1", "1", "2", "3"], ["2","3"]], "Name": [["A", "B", "C", "D"], ["E", "F"]]})
df_conc = pd.concat([df, df.apply(pd.Series.explode).reset_index().groupby(['index', 'Type']).agg(list).unstack().droplevel(level=0, axis=1).fillna("").apply(list)], axis=1)
df_conc

----------------------------------------------
    Type           Name         1       2   3
0   [1, 1, 2, 3]   [A, B, C, D] [A, B]  [C] [D]
1   [2, 3]         [E, F]               [E] [F]
----------------------------------------------

If nan values are accepted, just remove .fillna("").apply(list):

df_conc = pd.concat([df, df.apply(pd.Series.explode).reset_index().groupby(['index', 'Type']).agg(list).unstack().droplevel(level=0, axis=1)], axis=1)
df_conc

----------------------------------------------
    Type           Name         1       2   3
0   [1, 1, 2, 3]   [A, B, C, D] [A, B]  [C] [D]
1   [2, 3]         [E, F]       NaN     [E] [F]
----------------------------------------------
ko3
  • 1,757
  • 5
  • 13