1

I performed OHE on my data with sparse=True parameter - which doesn't seem doing anything?

I've try:

# One Hot Encoding
df_ohe = pd.get_dummies(df, columns=cats, drop_first=True, sparse=True)
df_ohe = df_ohe.sparse.to_coo().tocsr() #Explicitely convert
df_ohe.memory_usage().sum()

...which returns

AttributeError: Can only use the '.sparse' accessor with Sparse data.

Help would be appreciated. Thanks!

Stanislav Jirak
  • 725
  • 1
  • 7
  • 22

1 Answers1

1

You will want to import the csr_matrix method (which converts numpy arrays to a sparse matrix) using

from scipy.sparse import csr_matrix

You can then just write

df_ohe = pd.get_dummies(df, columns=cats, drop_first=True)
df_ohe = csr_matrix(df_ohe.values)

So note that here I removed the sparse=True from the get_dummies method and then changed the syntax for converting to a sparse matrix.

Tom C
  • 322
  • 3
  • 12
  • I marked that as correct answer but I'm running out of memory with that sparse matrix over a dataframe having a lot of columns after OHE. – Stanislav Jirak Oct 16 '19 at 11:20
  • Thanks for that. After a bit of research it looks like there is a bug in get_dummies that is causing this issue. See https://stackoverflow.com/questions/51709377/pd-get-dummies-dataframe-same-size-when-sparse-true-as-when-sparse-false and https://github.com/pandas-dev/pandas/issues/18686. To solve the memory issues might require some hacking. – Tom C Oct 16 '19 at 11:24
  • Yeah, I found that too. It's quite recent version update, it was working previously. Hope it gets solved soon. Thanks for the comment! – Stanislav Jirak Oct 16 '19 at 11:28