0

I am in the process of reducing the memory usage of my code. The goal of this code is handling some big dataset. Those are stored in Pandas dataframe if that is relevant.

Among many other data there are some small integers. As they contain some missing values (NA) Python has them set to the float64 type by default. I was trying to downcast them to some smaller int format (int8 or int16 for exemple), but I got an error because of the NA.

It seems that there are some new integer type (Int64) that can handle missing values but wouldn't help for the memory usage. I gave some tought about using a category, but I am not sure this will not create a bottleneck further down the pipeline. Downcasting float64 to float32 seems to be my main option for reducing memory usage (rounding error do not really matter for my usage).

Do I have a better option to reduce memory consumption of handling small integers with missing values ?

Lucas Morin
  • 373
  • 2
  • 11
  • 35
  • 1
    Any datatype that can represent missing values is likely to be expensive, since it can't just be a byte or two. It needs to be big enough to hold all your real values, and then have extra memory for the "NA" flag. – Barmar Oct 22 '20 at 08:01
  • 2
    Could you implement NA by yourself, for instance by using a signed type whose uses only the non negative range and use -1 to hold NA ? It would suppose doing the cast by yourself as well. – dspr Oct 22 '20 at 08:05
  • @Barmar : I had the intuition that a category with interger and NA would yield very low memory usage. Unfortunately the pipeline might need a significant rework to use category. – Lucas Morin Oct 22 '20 at 08:16
  • @dspr : interesting idea, however it would require me to handle my signed int another way and probably to also rework the pipeline significantly. – Lucas Morin Oct 22 '20 at 08:16
  • Another way would be to give a special meaning to the max value of an unsigned range (e.g. for a 8bit unsigned integer, 255 could be reserved for NA). For instance, Apple uses NSNotFound, which is the max of a 64bit range, to symbolize an unexisting value. – dspr Oct 22 '20 at 09:15

1 Answers1

2

The new (Pandas v1.0+) "Integer Array" data types do allow significant memory savings. Missing values are recognized by Pandas .isnull() and also are compatible with Pyarrow feather format that is disk-efficient for writing data. Feather requires consistent data type by column. See Pandas documentation here. Here is an example. Note the capital 'I' in the Pandas-specific Int16 data type.

import pandas as pd
import numpy as np

dftemp = pd.DataFrame({'dt_col': ['1/1/2020',np.nan,'1/3/2020','1/4/2020'], 'int_col':[4,np.nan,3,1],
                      'float_col':[0.0,1.0,np.nan,4.5],'bool_col':[True, False, False, True],'text_col':['a','b',None,'d']})

#Write to CSV (to be read back in to fully simulate CSV behavior with missing values etc.)
dftemp.to_csv('MixedTypes.csv', index=False)

lst_cols = ['int_col','float_col','bool_col','text_col']
lst_dtypes = ['Int16','float','bool','object']
dict_types = dict(zip(lst_cols,lst_dtypes))

#Unoptimized DataFrame    
df = pd.read_csv('MixedTypes.csv')
df

Result:

     dt_col  int_col  float_col  bool_col text_col
0  1/1/2020      4.0        0.0      True        a
1       NaN      NaN        1.0     False        b
2  1/3/2020      3.0        NaN     False      NaN
3  1/4/2020      1.0        4.5      True        d

Check memory usage (with special focus on int_col):

df.memory_usage()

Result:

Index        128
dt_col        32
int_col       32
float_col     32
bool_col       4
text_col      32
dtype: int64

Repeat with explicit assignment of variable types --including Int16 for int_col

df2 = pd.read_csv('MixedTypes.csv', dtype=dict_types, parse_dates=['dt_col'])
print(df2)

      dt_col  int_col  float_col  bool_col text_col
0 2020-01-01        4        0.0      True        a
1        NaT     <NA>        1.0     False        b
2 2020-01-03        3        NaN     False      NaN
3 2020-01-04        1        4.5      True        d

df2.memory_usage()

In larger scale data, this results in significant memory and disk space efficiency from my experience:

Index        128
dt_col        32
int_col       12
float_col     32
bool_col       4
text_col      32
dtype: int64
jdland
  • 44
  • 1
  • 1
  • 5