I have a dataframe with several string columns that I want to convert to categorical data so that I can run some models and extract important features from.
However, due to the amount of unique values, the one-hot encoded data expands into a large number of columns which is causing performance issues.
To combat this, I'm experimenting with the Sparse = True
parameter in get_dummies.
test1 = pd.get_dummies(X.loc[:,['col1','col2','col3','col4']].head(10000))
test2 = pd.get_dummies(X.loc[:,['col1','col2','col3','col4']].head(10000),sparse = True)
However, when I check info for my two comparison objects, they take up the same amount of memory. It doesn't seem like Sparse = True
uses any less space. Why is that?
test1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 537293 to 752152
Columns: 2253 entries,...
dtypes: uint8(2253)
memory usage: 21.6 MB
test2.info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
Int64Index: 10000 entries, 537293 to 752152
Columns: 2253 entries, ...
dtypes: uint8(2253)
memory usage: 21.9 MB