1

I have created a multi-hierarchical index from frames that have been indexed by time:

original_thing
 time                day_1  day_2  day_3 day_4
 2018-05-24 20:00:00  0     0      1     0
 2018-05-25 00:00:00  0     0      0     1
 2018-05-25 04:00:00  0     0      0     1
 2018-05-25 08:00:00  0     0      0     1

resampled and aggregated the info as different objects and packed them into a list

 DF_list = [original_thing, resampled_1, resampled_2]

using pandas concat with code that looks mostly like this:

thisthing = pandas.concat(DF_list, keys=range(len(DF_list), names=['one','time'], sort=True)

to get a Dataframe that looks like:

one  time                   day_1    day_2    day_3    day_4
 2    2018-05-24 00:00:00    0        0        1        0
 1    2018-05-24 12:00:00    0        0        1        0
 0    2018-05-24 20:00:00    0        0        1        0
 0    2018-05-25 00:00:00    0        0        0        1
 1    2018-05-25 00:00:00    0        0        0        1
 2    2018-05-25 00:00:00    0        0        0        1
 0    2018-05-25 04:00:00    0        0        0        1
 0    2018-05-25 08:00:00    0        0        0        1

I would like to take the index 'one' and get:

one  time                   id_1  id_2  id_3 day_...    
 2    2018-05-24 00:00:00    0     0     1    0
 1    2018-05-24 12:00:00    0     1     0    0
 0    2018-05-24 20:00:00    1     0     0    0
 0    2018-05-25 00:00:00    1     0     0    1
 1    2018-05-25 00:00:00    0     1     0    1
 2    2018-05-25 00:00:00    0     0     1    1
 0    2018-05-25 04:00:00    1     0     0    1
 0    2018-05-25 08:00:00    1     0     0    1

where id_'#' are the encoded indexes from 'one'

I've tried to encode it with:

conc_ohlc_dummies= pandas.get_dummies(conc_ohlc['one'], prefix= 'hours')

but am getting this error:

return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas_libs\index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc File "pandas_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc File "pandas_libs\hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'one'

I have also tried to reindex it to eliminate the index values. Is there any way other than writing to csv and reopening to do this?

thanks all

2 Answers2

0

You can use OneHotEncoder form sklearn.

Lets start with some boilerplate code:

 import pandas as pd
 import numpy as np
 from sklearn.preprocessing import OneHotEncoder
 df = pd.DataFrame({"one":[2,1,0,0,1,2], "abcd":[4,6,3,6,7,1]})
 print(df)

   one  abcd
0    2     4
1    1     6
2    0     3
3    0     6
4    1     7
5    2     1

Now you can fit the one hot encoder object with these values ...

ohe = OneHotEncoder()
ohe.fit( df.one.values.reshape(-1, 1) )
vals = ohe.transform( df.one.values.reshape(-1, 1) ).toarray()
print(vals)

array([[0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

Now just insert them into the data frame:

for i in range(vals.shape[1]):
    df['id_{}'.format(i)] = vals[:, i]

The final data frame should look like:

   one  abcd  id_0  id_1  id_2
0    2     4   0.0   0.0   1.0
1    1     6   0.0   1.0   0.0
2    0     3   1.0   0.0   0.0
3    0     6   1.0   0.0   0.0
4    1     7   0.0   1.0   0.0
5    2     1   0.0   0.0   1.0
ssm
  • 5,277
  • 1
  • 24
  • 42
0

Initially I was trying to use .reindex() method to drop off any indexing in the dataframes, but found .reset_index() worked. With the indexing out of the way .get_dummies() and .merge() encoded and added the info back to the frame for me. I did have to set the index again then sorted for good measure:

    thisthing= thisthing.reset_index()
    thisthing_dummies= pandas.get_dummies(thisthing['one'], prefix='hours', drop_first=True)
    thisthing= thisthing.merge(thisthing_dummies, left_index=True, right_index=True)
    thisthing= thisthing.set_index(['time','one'])
    thisthing.sort_values(by=['time', 'one'],inplace=True)