is there a way to "unstack" a dataframe and return as a list value

Question

I have a dataframe looks like this:

import pandas as pd

df = pd.DataFrame({'type_a': [1,0,0,0,0,1,0,0,0,1],
                   'type_b': [0,1,0,0,0,0,0,0,1,1],
                   'type_c': [0,0,1,1,1,1,0,0,0,0],
                   'type_d': [1,0,0,0,0,1,1,0,1,0],
                  })

I wanna create a new column based on those 4 columns, it will return the column names whenever the value in those 4 columns equals to 1, if there are multiple columns equal to 1 at the same time then it will return the list of those columns names, otherwise it will be nan.

The output dataframe will look like this:

df = pd.DataFrame({'type_a': [1,0,0,0,0,1,0,0,0,1],
                   'type_b': [0,1,0,0,0,0,0,0,1,1],
                   'type_c': [0,0,1,1,1,1,0,0,0,0],
                   'type_d': [1,0,0,0,0,1,1,0,1,0],
                   'type':[['type_a','type_d'], 'type_b', 'type_c', 'type_c','type_c', ['type_a','type_c','type_d'], 'type_d', 'nan', ['type_b','type_d'],['type_a','type_b']]
                  })

Any help will be really appreciated. Thanks!

@rpanai please don't spam every answer with your comment. once is fine. — Michael Delgado, Sep 29 '22 at 00:55
@MichaelDelgado you are right. But what is the best way here to let other people who answered here about the timing I added? I thought it could be interesting as comparative. — rpanai, Sep 29 '22 at 01:00

Anoushiravan R · Accepted Answer · 2022-09-29T08:36:03.187

This is also another way:

import pandas as pd

df['type'] = (pd.melt(df.reset_index(), id_vars='index')
 .query('value == 1')
 .groupby('index')['variable']
 .apply(list))


   type_a  type_b  type_c  type_d                      type
0       1       0       0       1          [type_a, type_d]
1       0       1       0       0                  [type_b]
2       0       0       1       0                  [type_c]
3       0       0       1       0                  [type_c]
4       0       0       1       0                  [type_c]
5       1       0       1       1  [type_a, type_c, type_d]
6       0       0       0       1                  [type_d]
7       0       0       0       0                       NaN
8       0       1       0       1          [type_b, type_d]
9       1       1       0       0          [type_a, type_b]

rpanai · Answer 2 · 2022-09-29T00:47:43.587

You can use this answer and adapt to your case.

import pandas as pd
df = pd.DataFrame({'type_a': [1,0,0,0,0,1,0,0,0,1],
                   'type_b': [0,1,0,0,0,0,0,0,1,1],
                   'type_c': [0,0,1,1,1,1,0,0,0,0],
                   'type_d': [1,0,0,0,0,1,1,0,1,0],
                  })

df['type'] = df.dot(df.columns + ',')\
    .str.rstrip(',')\
    .apply(lambda x: x.split(','))

Where the output is

   type_a  type_b  type_c  type_d                      type
0       1       0       0       1          [type_a, type_d]
1       0       1       0       0                  [type_b]
2       0       0       1       0                  [type_c]
3       0       0       1       0                  [type_c]
4       0       0       1       0                  [type_c]
5       1       0       1       1  [type_a, type_c, type_d]
6       0       0       0       1                  [type_d]
7       0       0       0       0                        []
8       0       1       0       1          [type_b, type_d]
9       1       1       0       0          [type_a, type_b]

Edit 1

The general case will be

df['type'] = df.eq(1).dot(df.columns + ',')\
    .str.rstrip(',')\
    .apply(lambda x: x.split(','))

Edit 2

Eventually you can avoid lambda (in case your dataframe is big)

df['type'] = df.eq(1).dot(df.columns + ',')\
    .str.rstrip(',')\
    .str.split(',')

Edit 3: TIMING

Here I want to compare few solutions proposed here.

Generate Data

import pandas as pd
import numpy as np

n = 10_000
columns = ['type_a', 'type_b', 'type_c', 'type_d']
# set seed for reproducibility
np.random.seed(0)
df = pd.DataFrame(
    np.random.randint(2, size=(n, 4)),
                     columns=columns)
# save copy of original data
df_bk = df.copy()

Test load the data

As we are going to load the data using timeit we want to know how long it takes.

%%timeit -n 10 -r 10
df = df_bk.copy()

142 µs ± 40.4 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)

@bitflip's solution

%%timeit -n 10 -r 10
df = df_bk.copy()
df['type'] = df.apply(lambda x: 
                df.columns[x.eq(1)].tolist(), axis=1)

782 ms ± 33.2 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

@Naveed's solution

%%timeit -n 10 -r 10
df = df_bk.copy()
df['type'] = df.mul(df.columns)\
    .apply(lambda x: list(pd.Series(i for i in x if len(i)>0)), axis=1)

619 ms ± 22.1 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

@Anoushiravan R's solution

%%timeit -n 10 -r 10
df = df_bk.copy()
df['type'] = (pd.melt(df.reset_index(), id_vars='index')
 .query('value == 1')
 .groupby('index')['variable']
 .apply(lambda x:[str for str in x]))

148 ms ± 12.6 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

@rpanai's solution

%%timeit -n 10 -r 10
df = df_bk.copy()
df['type'] = df.eq(1).dot(df.columns + ',')\
    .str.rstrip(',')\
    .str.split(',')

13 ms ± 2.61 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

Conclusion

As you can see from the following image (please click it to expand) the accepted solution is ways faster than others. Yet the vectorial solution suggested here manages to be 11x faster than the accepted solution.

Glad to be the runner-up lol. It is not too bad. It would've been even better if we could check this on a larger data set. Because some of the solutions I come across on stack may be quite inefficient for huge data sets as the goal is to meet the output of a sample data set. Thank you very much for the benchmark. — Anoushiravan R, Sep 29 '22 at 08:40
@AnoushiravanR I tested with 1M and it looks like the difference remains the same. 10.2 s vs 825 ms. — rpanai, Sep 29 '22 at 12:25

score 0 · Answer 3 · answered Sep 28 '22 at 21:25

here is one more way

df.mul(df.columns).apply(lambda x: list(pd.Series(i for i in x if len(i)>0)), axis=1)

0            [type_a, type_d]
1                    [type_b]
2                    [type_c]
3                    [type_c]
4                    [type_c]
5    [type_a, type_c, type_d]
6                    [type_d]
7                          []
8            [type_b, type_d]
9            [type_a, type_b]

score 0 · Answer 4 · answered Sep 28 '22 at 22:07

And another one:

import pandas as pd
import numpy as np

df = pd.DataFrame({'type_a': [1,0,0,0,0,1,0,0,0,1],
                   'type_b': [0,1,0,0,0,0,0,0,1,1],
                   'type_c': [0,0,1,1,1,1,0,0,0,0],
                   'type_d': [1,0,0,0,0,1,1,0,1,0],
                  })

df['type']=''

for i,r in df.iterrows():
    t=[k for k in r.keys() if r[k]==1]
    if t:
        if len(t)==1:
            df.at[i,'type']=t[0]
        else:
            df.at[i,'type']=t
    else:
        df.at[i,'type']=np.nan