4
  1. I have a dictionary of large scale MultiIndex series where both index levels are datetime values. The abstract short example of one of it is:
idx_level_0 = pd.date_range('2020-01-01', '2020-04-01', freq = 'M')
idx_level_1 = pd.date_range('2020-04-01', '2020-07-01', freq = 'M')
idx_dates = pd.MultiIndex.from_product([idx_level_0, idx_level_1], names = ['Event_Date', 'Observation_Date'])
ser_info_dated = pd.Series(range(len(idx_level_0) * len(idx_level_1)), index = idx_dates, name = 'Some_Values') / 33
  1. I need to save all the data, so I choose to separately import each series to common HDF5 file with dictionary key as an hdf key. When I saved it as is, my file volume is about 4 Gb, so I'm trying to make it thinner. Also, further I need to process all series data across index, so I need some global identification way. My idea was to make common collection of dates from both levels from all the series (there are about 11,000 unique dates) and replace it with unique number identifier to have an opportunity to recover original index for all the series. But it makes sense only if I could convert number values to int16 type. So I tried such a sequence (here I simplify it for single series):
list_levels_dates = sorted(list(set(idx_level_0) | set(idx_level_1)))
dict_to_numbers = dict(zip(list_levels_dates, range(len(list_levels_dates))))
df_info_numbered = ser_info_dated.reset_index().replace({'Event_Date': dict_to_numbers, 'Observation_Date': dict_to_numbers})
df_info_downcasted = df_info_numbered.copy()
df_info_downcasted[['Event_Date', 'Observation_Date']] = df_info_downcasted[['Event_Date', 'Observation_Date']].astype('int16')

It seemes to be a success:

print('df_info_downcasted column types:\n', df_info_downcasted.dtypes)

shows such a result:

df_info_downcasted column types:
Event_Date            int16
Observation_Date      int16
Some_Values         float64
  1. But when I moving columns back to index levels, it become int64 again:
ser_info_downcasted = df_info_downcasted.set_index(['Event_Date', 'Observation_Date']).squeeze()
print('ser_info_downcasted index level 0 type: ', ser_info_downcasted.index.levels[0].dtype)
print('ser_info_downcasted index level 1 type: ', ser_info_downcasted.index.levels[1].dtype)
ser_info_downcasted index level 0 type:  int64
ser_info_downcasted index level 1 type:  int64
  1. I tried additional manipulations, but it also come to a failure:
ser_info_astyped = ser_info_downcasted.copy()
ser_info_astyped.index = ser_info_astyped.index.set_levels(ser_info_astyped.index.levels[0].astype('int16'), level = 0)
ser_info_astyped.index = ser_info_astyped.index.set_levels(ser_info_astyped.index.levels[1].astype('int16'), level = 1)
print('ser_info_astyped index level 0 type: ', ser_info_astyped.index.levels[0].dtype)
print('ser_info_astyped index level 1 type: ', ser_info_astyped.index.levels[1].dtype)
ser_info_astyped index level 0 type:  int64
ser_info_astyped index level 1 type:  int64
  1. So I extremely need a suggestions how to explicitly convert integer types to shorter ones or alternative proposals how to make series volume shorter. I also tried to append all the series to one huge series, but it is raising a memory error.
user4157124
  • 2,809
  • 13
  • 27
  • 42
igharok
  • 95
  • 3
  • I want to replicate your example in my machine, therefore I am asking what is the size of each series and how many series do you have? If you can add a full data to your question, it would be better. – Hamzah Oct 17 '22 at 06:39

1 Answers1

3

TL;DR: Pandas converts indexes to 64 byte value, best chance to minimize file is with HDF serialization.

Pandas does not seem to support int16 dtype as an index.

Int64Index is a fundamental basic index in pandas. This is an immutable array implementing an ordered, sliceable set.

source

This is further reinforced in pandas.Index.astype.

Note that any signed integer dtype is treated as 'int64', and any unsigned integer dtype is treated as 'uint64', regardless of the size.

source

So essentially our int16 values gets casted to int64 when set as an index. I have not worked with HDF5 before, but I tried to see what can be done to minimize the file size.

Looking at memory allocation

>>> print(ser_info_dated)
... Event_Date  Observation_Date
... 2020-01-31  2020-04-30          0.000000
...             2020-05-31          0.030303
...             2020-06-30          0.060606
... 2020-02-29  2020-04-30          0.090909
...             2020-05-31          0.121212
...             2020-06-30          0.151515
... 2020-03-31  2020-04-30          0.181818
...             2020-05-31          0.212121
...             2020-06-30          0.242424
>>> print(ser_info_dated.memory_usage(index=True, deep=True))
... 478 # memory usage in bytes

vs

>>> print(df_info_downcasted)
...    Event_Date  Observation_Date  Some_Values
... 0           0                 3     0.000000
... 1           0                 4     0.030303
... 2           0                 5     0.060606
... 3           1                 3     0.090909
... 4           1                 4     0.121212
... 5           1                 5     0.151515
... 6           2                 3     0.181818
... 7           2                 4     0.212121
... 8           2                 5     0.242424
>>> print(df_info_downcasted.memory_usage(index=True, deep=True))
... Index               128
... Event_Date           18
... Observation_Date     18
... Some_Values          72
... dtype: int64
>>> print(df_info_downcasted.info())
... <class 'pandas.core.frame.DataFrame'>
... RangeIndex: 9 entries, 0 to 8
... Data columns (total 3 columns):
...  #   Column            Non-Null Count  Dtype  
... ---  ------            --------------  -----  
...  0   Event_Date        9 non-null      int16  
...  1   Observation_Date  9 non-null      int16  
...  2   Some_Values       9 non-null      float64
... dtypes: float64(1), int16(2)
... memory usage: 236.0 bytes

We can see that most of the memory is used in the index. When saving as a HDF5 it does not seem to matter though. (I increased the range to be 12H for testing purposes)

>>> ser_info_dated.to_hdf("ser.h5", "ser")
>>> print(f"{os.path.getsize('ser.h5')/1000} kb")
... 412.628 kb

vs

>>> down_set_hdf5 = df_info_downcasted.to_hdf("down.h5", "down")
>>> print(f"{os.path.getsize('down.h5')/1000} kb")
... 413.204 kb

I do not know how you save HDF5 files, but in Pandas there is a compression argument complevel which can be useful.

>>> ser_hdf5 = ser_info_dated.to_hdf("ser.h5", "ser", complevel=9)
>>> print(f"{os.path.getsize('ser.h5')/1000} kb")
... 155.452 kb
Oddaspa
  • 731
  • 5
  • 21