3

I have a dataframe looks like this:

import pandas as pd

df = pd.DataFrame({'type_a': [1,0,0,0,0,1,0,0,0,1],
                   'type_b': [0,1,0,0,0,0,0,0,1,1],
                   'type_c': [0,0,1,1,1,1,0,0,0,0],
                   'type_d': [1,0,0,0,0,1,1,0,1,0],
                  })

I wanna create a new column based on those 4 columns, it will return the column names whenever the value in those 4 columns equals to 1, if there are multiple columns equal to 1 at the same time then it will return the list of those columns names, otherwise it will be nan.

The output dataframe will look like this:

df = pd.DataFrame({'type_a': [1,0,0,0,0,1,0,0,0,1],
                   'type_b': [0,1,0,0,0,0,0,0,1,1],
                   'type_c': [0,0,1,1,1,1,0,0,0,0],
                   'type_d': [1,0,0,0,0,1,1,0,1,0],
                   'type':[['type_a','type_d'], 'type_b', 'type_c', 'type_c','type_c', ['type_a','type_c','type_d'], 'type_d', 'nan', ['type_b','type_d'],['type_a','type_b']]
                  })

Any help will be really appreciated. Thanks!

rpanai
  • 12,515
  • 2
  • 42
  • 64
Ali
  • 113
  • 4

4 Answers4

4

This is also another way:

import pandas as pd

df['type'] = (pd.melt(df.reset_index(), id_vars='index')
 .query('value == 1')
 .groupby('index')['variable']
 .apply(list))


   type_a  type_b  type_c  type_d                      type
0       1       0       0       1          [type_a, type_d]
1       0       1       0       0                  [type_b]
2       0       0       1       0                  [type_c]
3       0       0       1       0                  [type_c]
4       0       0       1       0                  [type_c]
5       1       0       1       1  [type_a, type_c, type_d]
6       0       0       0       1                  [type_d]
7       0       0       0       0                       NaN
8       0       1       0       1          [type_b, type_d]
9       1       1       0       0          [type_a, type_b]
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
2

You can use this answer and adapt to your case.

import pandas as pd
df = pd.DataFrame({'type_a': [1,0,0,0,0,1,0,0,0,1],
                   'type_b': [0,1,0,0,0,0,0,0,1,1],
                   'type_c': [0,0,1,1,1,1,0,0,0,0],
                   'type_d': [1,0,0,0,0,1,1,0,1,0],
                  })

df['type'] = df.dot(df.columns + ',')\
    .str.rstrip(',')\
    .apply(lambda x: x.split(','))

Where the output is

   type_a  type_b  type_c  type_d                      type
0       1       0       0       1          [type_a, type_d]
1       0       1       0       0                  [type_b]
2       0       0       1       0                  [type_c]
3       0       0       1       0                  [type_c]
4       0       0       1       0                  [type_c]
5       1       0       1       1  [type_a, type_c, type_d]
6       0       0       0       1                  [type_d]
7       0       0       0       0                        []
8       0       1       0       1          [type_b, type_d]
9       1       1       0       0          [type_a, type_b]

Edit 1

The general case will be

df['type'] = df.eq(1).dot(df.columns + ',')\
    .str.rstrip(',')\
    .apply(lambda x: x.split(','))

Edit 2

Eventually you can avoid lambda (in case your dataframe is big)

df['type'] = df.eq(1).dot(df.columns + ',')\
    .str.rstrip(',')\
    .str.split(',')

Edit 3: TIMING

Here I want to compare few solutions proposed here.

Generate Data

import pandas as pd
import numpy as np

n = 10_000
columns = ['type_a', 'type_b', 'type_c', 'type_d']
# set seed for reproducibility
np.random.seed(0)
df = pd.DataFrame(
    np.random.randint(2, size=(n, 4)),
                     columns=columns)
# save copy of original data
df_bk = df.copy()

Test load the data

As we are going to load the data using timeit we want to know how long it takes.

%%timeit -n 10 -r 10
df = df_bk.copy()
142 µs ± 40.4 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)

@bitflip's solution

%%timeit -n 10 -r 10
df = df_bk.copy()
df['type'] = df.apply(lambda x: 
                df.columns[x.eq(1)].tolist(), axis=1)
782 ms ± 33.2 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

@Naveed's solution

%%timeit -n 10 -r 10
df = df_bk.copy()
df['type'] = df.mul(df.columns)\
    .apply(lambda x: list(pd.Series(i for i in x if len(i)>0)), axis=1)
619 ms ± 22.1 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

@Anoushiravan R's solution

%%timeit -n 10 -r 10
df = df_bk.copy()
df['type'] = (pd.melt(df.reset_index(), id_vars='index')
 .query('value == 1')
 .groupby('index')['variable']
 .apply(lambda x:[str for str in x]))
148 ms ± 12.6 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

@rpanai's solution

%%timeit -n 10 -r 10
df = df_bk.copy()
df['type'] = df.eq(1).dot(df.columns + ',')\
    .str.rstrip(',')\
    .str.split(',')
13 ms ± 2.61 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

Conclusion

As you can see from the following image (please click it to expand) the accepted solution is ways faster than others. Yet the vectorial solution suggested here manages to be 11x faster than the accepted solution.

enter image description here

rpanai
  • 12,515
  • 2
  • 42
  • 64
  • 1
    Glad to be the runner-up lol. It is not too bad. It would've been even better if we could check this on a larger data set. Because some of the solutions I come across on stack may be quite inefficient for huge data sets as the goal is to meet the output of a sample data set. Thank you very much for the benchmark. – Anoushiravan R Sep 29 '22 at 08:40
  • 1
    @AnoushiravanR I tested with 1M and it looks like the difference remains the same. 10.2 s vs 825 ms. – rpanai Sep 29 '22 at 12:25
0

here is one more way

df.mul(df.columns).apply(lambda x: list(pd.Series(i for i in x if len(i)>0)), axis=1)
0            [type_a, type_d]
1                    [type_b]
2                    [type_c]
3                    [type_c]
4                    [type_c]
5    [type_a, type_c, type_d]
6                    [type_d]
7                          []
8            [type_b, type_d]
9            [type_a, type_b]
Naveed
  • 11,495
  • 2
  • 14
  • 21
0

And another one:

import pandas as pd
import numpy as np

df = pd.DataFrame({'type_a': [1,0,0,0,0,1,0,0,0,1],
                   'type_b': [0,1,0,0,0,0,0,0,1,1],
                   'type_c': [0,0,1,1,1,1,0,0,0,0],
                   'type_d': [1,0,0,0,0,1,1,0,1,0],
                  })

df['type']=''

for i,r in df.iterrows():
    t=[k for k in r.keys() if r[k]==1]
    if t:
        if len(t)==1:
            df.at[i,'type']=t[0]
        else:
            df.at[i,'type']=t
    else:
        df.at[i,'type']=np.nan