0

I have a dataframe with event data. I have two columns: Primary and Secondary. The Primary and Secondary columns both contain lists of tags (e.g., ['Fun event', 'Dance party']).

      primary               secondary                      combined
['booze', 'party']    ['singing', 'dance']    ['booze', 'party', 'singing', 'dance']
    ['concert']        ['booze', 'vocals']     ['concert', 'booze', 'vocals']

I want to dummy code the data so that primary columns have a 1 code, non-observed columns have a 0, and values in the secondary column have a .5 value. Like so:

combined                                 booze        party   singing    dance    concert    vocals
['booze', 'party', 'singing', 'dance']     1            1       .5        .5        0           0
['concert', 'booze', 'vocals']            .5            0        0         0        1          .5
Daniel
  • 363
  • 3
  • 11

2 Answers2

1

Here's one approach that works by transforming the primary and secondary columns' values into columns on the dataframe:

df = pd.DataFrame({
        'primary': [['booze', 'party'], ['concert']],
        'secondary': [['singing', 'dance'], ['booze', 'vocals']],
    })

# create primary and secondary indicator columns
iprim = df.primary.apply(lambda v: pd.Series([1] * len(v), index=v))
isec = df.secondary.apply(lambda v: pd.Series([.5] * len(v), index=v))

# join with primary, then update from secondary columns
df = df.join(iprim).join(isec, rsuffix='_')
df.drop([c for c in df.columns if c.endswith('_')], axis=1, inplace=True)
df.update(isec)
df.fillna(0)

=>

    primary        secondary        booze   concert     party      dance    singing     vocals
0   [booze, party] [singing, dance] 1.0     0.0         1.0         0.5         0.5     0.0
1   [concert]      [booze, vocals]  0.5     1.0         0.0         0.0         0.0     0.5

Note the second .join() uses rsuffix to add columns that were already in primary, whereas .update() is used to overwrite values in the primary columns. .drop() removes these columns. Rearrange to prefer primary over secondary.

miraculixx
  • 10,034
  • 2
  • 41
  • 60
1
df1=pd.get_dummies(df.combined.apply(pd.Series).stack()).sum(level=0)
df1[df1.apply(lambda x : [x.name in y for y in df.iloc[x.index,2]])]-=0.5

df1
Out[173]: 
   booze  concert  dance  party  singing  vocals
0    1.0        0    0.5      1      0.5     0.0
1    0.5        1    0.0      0      0.0     0.5

Datainput :

df = pd.DataFrame({'primary':   [['booze', 'party'] ,  ['concert']],
                   'secondary':   [['singing', 'dance'], ['booze', 'vocals']],
                   'combined': [['booze', 'party', 'singing', 'dance'],   ['concert', 'booze', 'vocals']]})
BENY
  • 317,841
  • 20
  • 164
  • 234