0

So I have a pandas dataframe where certain columns have values of type list and a mix of columns of non-numeric and numeric data.

Example data

   dst_address   dst_enforcement   fwd_count ...
1  1.2.3.4       [Any,core]        8
2  3.4.5.6       []                9
3  6.7.8.9       [Any]             10
4  8.10.3.2      [core]            0

So far I've been able to find out which columns are non-numeric by these 2 lines of code

col_groups = df.columns.to_series().groupby(df.dtypes).groups
non_numeric_cols = col_groups[np.dtype('O')]

Of all these non-numeric columns, I need to figure out which ones have list as data type and I want to perform one-hot encoding on all non-numeric columns (including those list type)

EDIT: my expected output for above example would be something like

   1.2.3.4 | 3.4.5.6 | 6.7.8.9 | 8.10.3.2 | empty | Any | core | fwd_count ...
1  1         0         0         0          0       1     1      8
2  0         1         0         0          1       0     0      9
3  0         0         1         0          0       1     0      10
4  0         0         0         1          0       0     1      0    
tlaminator
  • 946
  • 2
  • 9
  • 23

3 Answers3

2

I use 3 steps as follows:

df['dst_enforcement'] = df.dst_enforcement.apply(lambda x: x if x else ['empty'])
dm1 = pd.get_dummies(df[df.columns.difference(['dst_enforcement'])], prefix='', prefix_sep='')
dm2 = df.dst_enforcement.str.join('-').str.get_dummies('-')
pd.concat([dm1, dm2], axis=1)

Out[1221]:
   fwd_count  1.2.3.4  3.4.5.6  6.7.8.9  8.10.3.2  Any  core  empty
1          8        1        0        0         0    1     1      0
2          9        0        1        0         0    0     0      1
3         10        0        0        1         0    1     0      0
4          0        0        0        0         1    0     1      0
Andy L.
  • 24,909
  • 4
  • 17
  • 29
  • 1
    This is better than mine. +1 – anky May 25 '19 at 05:46
  • @Andy other than dst_enforcement, there may be other columns with list as value type. Is there a way to check for this condition or do I have to manually repeat this code block for every list-type column? – tlaminator May 28 '19 at 15:38
  • @tlaminator: to find columns having list as value type, just use `applymap` as follows `df.applymap(lambda x: isinstance(x, list)).any()`. Any column contains list will return `True`, others return `False`. Then, just apply logic for columns having list all at once and columns having no list all at others. For detail, I think it deserves a new question. Please ask a new question if you want an expansion of this question – Andy L. May 28 '19 at 17:13
  • @Andy: Thanks, this is what I need. – tlaminator May 28 '19 at 18:37
1

Use unnesting to unnest the lists to seperate roes and call pd.get_dummies():

df_new=unnesting(df,['dst_enforcement']).combine_first(df)
df_new.dst_enforcement=df_new.dst_enforcement.apply(lambda y: 'empty' if len(y)==0 else y)
m=pd.get_dummies(df_new,prefix='',prefix_sep='').groupby('fwd_count').first().reset_index()
print(m)

   fwd_count  1.2.3.4  3.4.5.6  6.7.8.9  8.10.3.2  Any  core  empty
0        0.0        0        0        0         1    0     1      0
1        8.0        1        0        0         0    1     0      0
2        9.0        0        1        0         0    0     0      1
3       10.0        0        0        1         0    1     0      0

Adding the function used for convenience:

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
             pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')
anky
  • 74,114
  • 11
  • 41
  • 70
0

Give a go to:

non_numeric_cols = col_groups[np.dtype('O')]

for non in non_numeric_cols:
    print(pd.get_dummies(df[non].apply(pd.Series)))

Output:

   0_1.2.3.4  0_3.4.5.6  0_6.7.8.9  0_8.10.3.2
   0          1          0          0           0
   1          0          1          0           0
   2          0          0          1           0
   3          0          0          0           1

   0_Any  0_core  1_core
   0      1       0       1
   1      0       0       0
   2      1       0       0
   3      0       1       0

When you don't have neither "Any" or "core" the whole row is zeros.

Good luck.

Konstantin Grigorov
  • 1,356
  • 12
  • 20