- I have a dictionary of large scale MultiIndex series where both index levels are datetime values. The abstract short example of one of it is:
idx_level_0 = pd.date_range('2020-01-01', '2020-04-01', freq = 'M')
idx_level_1 = pd.date_range('2020-04-01', '2020-07-01', freq = 'M')
idx_dates = pd.MultiIndex.from_product([idx_level_0, idx_level_1], names = ['Event_Date', 'Observation_Date'])
ser_info_dated = pd.Series(range(len(idx_level_0) * len(idx_level_1)), index = idx_dates, name = 'Some_Values') / 33
- I need to save all the data, so I choose to separately import each series to common HDF5 file with dictionary key as an hdf key. When I saved it as is, my file volume is about 4 Gb, so I'm trying to make it thinner. Also, further I need to process all series data across index, so I need some global identification way. My idea was to make common collection of dates from both levels from all the series (there are about 11,000 unique dates) and replace it with unique number identifier to have an opportunity to recover original index for all the series. But it makes sense only if I could convert number values to int16 type. So I tried such a sequence (here I simplify it for single series):
list_levels_dates = sorted(list(set(idx_level_0) | set(idx_level_1)))
dict_to_numbers = dict(zip(list_levels_dates, range(len(list_levels_dates))))
df_info_numbered = ser_info_dated.reset_index().replace({'Event_Date': dict_to_numbers, 'Observation_Date': dict_to_numbers})
df_info_downcasted = df_info_numbered.copy()
df_info_downcasted[['Event_Date', 'Observation_Date']] = df_info_downcasted[['Event_Date', 'Observation_Date']].astype('int16')
It seemes to be a success:
print('df_info_downcasted column types:\n', df_info_downcasted.dtypes)
shows such a result:
df_info_downcasted column types:
Event_Date int16
Observation_Date int16
Some_Values float64
- But when I moving columns back to index levels, it become int64 again:
ser_info_downcasted = df_info_downcasted.set_index(['Event_Date', 'Observation_Date']).squeeze()
print('ser_info_downcasted index level 0 type: ', ser_info_downcasted.index.levels[0].dtype)
print('ser_info_downcasted index level 1 type: ', ser_info_downcasted.index.levels[1].dtype)
ser_info_downcasted index level 0 type: int64
ser_info_downcasted index level 1 type: int64
- I tried additional manipulations, but it also come to a failure:
ser_info_astyped = ser_info_downcasted.copy()
ser_info_astyped.index = ser_info_astyped.index.set_levels(ser_info_astyped.index.levels[0].astype('int16'), level = 0)
ser_info_astyped.index = ser_info_astyped.index.set_levels(ser_info_astyped.index.levels[1].astype('int16'), level = 1)
print('ser_info_astyped index level 0 type: ', ser_info_astyped.index.levels[0].dtype)
print('ser_info_astyped index level 1 type: ', ser_info_astyped.index.levels[1].dtype)
ser_info_astyped index level 0 type: int64
ser_info_astyped index level 1 type: int64
- So I extremely need a suggestions how to explicitly convert integer types to shorter ones or alternative proposals how to make series volume shorter. I also tried to append all the series to one huge series, but it is raising a memory error.