transform scipy sparse csr to pandas?

Question

I have used the

sklearn.preprocessing.OneHotEncoder

to transform some data the output is scipy.sparse.csr.csr_matrix how can I merge it back into my original dataframe along with the other columns?

I tried to use pd.concat but I get

TypeError: cannot concatenate a non-NDFrame object

Thanks

Stefan · Accepted Answer · 2021-07-02T18:44:46.487

71

If A is csr_matrix, you can use .toarray() (there's also .todense() that produces a numpy matrix, which is also works for the DataFrame constructor):

df = pd.DataFrame(A.toarray())

You can then use this with pd.concat().

A = csr_matrix([[1, 0, 2], [0, 3, 0]])
    
  (0, 0)    1
  (0, 2)    2
  (1, 1)    3

<class 'scipy.sparse.csr.csr_matrix'>

pd.DataFrame(A.todense())

   0  1  2
0  1  0  2
1  0  3  0

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0    2 non-null int64
1    2 non-null int64
2    2 non-null int64

In version 0.20, pandas introduced sparse data structures, including the SparseDataFrame.

In pandas 1.0, SparseDataFrame was removed:

In older versions of pandas, the SparseSeries and SparseDataFrame classes were the preferred way to work with sparse data. With the advent of extension arrays, these subclasses are no longer needed. Their purpose is better served by using a regular Series or DataFrame with sparse values instead.

The migration guide shows how to use these new data structures.

For instance, to create a DataFrame from a sparse matrix:

from scipy.sparse import csr_matrix

A = csr_matrix([[1, 0, 2], [0, 3, 0]])

df = pd.DataFrame.sparse.from_spmatrix(A, columns=['A', 'B', 'C'])

df

   A  B  C
0  1  0  2
1  0  3  0

df.dtypes
A    Sparse[float64, 0]
B    Sparse[float64, 0]
C    Sparse[float64, 0]
dtype: object

Alternatively, you can pass sparse matrices to sklearn to avoid running out of memory when converting back to pandas. Just convert your other data to sparse format by passing a numpy array to the scipy.sparse.csr_matrix constructor and use scipy.sparse.hstack to combine (see docs).

edited Jul 02 '21 at 18:44

answered May 01 '16 at 15:33

Stefan

41,759
13
76
81

1

What can I do if my A.toarray() leads to a MemoryError? Is there any way to create the Dataframe without converting it back to a ndarray? – user77005 Dec 28 '17 at 13:52
1

You may want to take a look at pandas [sparse data structures](https://pandas.pydata.org/pandas-docs/stable/sparse.html) – Stefan Jan 13 '18 at 17:18
Is there any chance I can pass my values as labels for the new dataframe? E.g., if the hot encoder had values given from column 'letter' with 'a a b b c a' that my new dataframe would be headed by letter_a, letter_b etc, much like with the dummy-encoder? – Anne Apr 10 '18 at 16:43
Solved this by passing different arguments to the dummy encoder – Anne Apr 13 '18 at 13:18
Caveat: if sparse matrix is too big, it will throw memory error since `.toarray()` creates a dense metrix. – CKM Jul 15 '19 at 15:03
It's more than a caveat. This is completely useless with any sparse matrix that has any business being such. – cangrejo Jan 22 '20 at 16:32
@cangrejo how is `.todense()` useless if OP wants to incorporate sparse matrix into DataFrame and how is `SparseDataFrame()` useless as an alternative given memory constraints? – Stefan Jan 22 '20 at 20:12
@Stefan Sparse formats are often employed to store matrices too large to fit in memory. – cangrejo Jan 22 '20 at 20:59
@cangrejo OP explicitly intended to combine a sparse matrix with a `DataFrame`. The answer demonstrates how to do this. How is this completely useless, rather than not applicable in case the data does not fit into memory, however frequent these cases may be? – Stefan Jan 23 '20 at 18:23
**Outdated** scince pandas 1.0 – JulesDoe Jul 02 '21 at 12:47

Christopher Peisert · Answer 2 · 2020-11-12T18:48:37.790

UPDATE for Pandas 1.0+

Per the Pandas Sparse data structures documentation, SparseDataFrame and SparseSeries have been removed.

Sparse Pandas Dataframes

Previous Way

pd.SparseDataFrame({"A": [0, 1]})

New Way

pd.DataFrame({"A": pd.arrays.SparseArray([0, 1])})

Working with SciPy sparse `csr_matrix`

Previous Way

from scipy.sparse import csr_matrix
matrix = csr_matrix((3, 4), dtype=np.int8)
df = pd.SparseDataFrame(matrix, columns=['A', 'B', 'C'])

New Way

from scipy.sparse import csr_matrix
import numpy as np
import pandas as pd

matrix = csr_matrix((3, 4), dtype=np.int8)
df = pd.DataFrame.sparse.from_spmatrix(matrix, columns=['A', 'B', 'C', 'D'])
df.dtypes

Output:

A    Sparse[int8, 0]
B    Sparse[int8, 0]
C    Sparse[int8, 0]
D    Sparse[int8, 0]
dtype: object

Conversion from Sparse to Dense

df.sparse.to_dense()

Output:

   A  B  C  D
0  0  0  0  0
1  0  0  0  0
2  0  0  0  0

Sparse Properties

df.sparse.density

Output:

0.0

score 3 · Answer 3 · answered Nov 16 '17 at 08:19

3

You could also avoid getting back a sparse matrix in the first place by setting the parameter sparse to False when creating the Encoder.

The documentation of the OneHotEncoder states:

sparse : boolean, default=True

Will return sparse matrix if set True else will return an array.

Then you can again call the DataFrame constructor to transform the numpy array to a DataFrame.

answered Nov 16 '17 at 08:19

scriptator

314
2
7

True, although you might run out of RAM creating a dense matrix with `OneHotEncoder` – Dudelstein Apr 12 '23 at 11:03

score 0 · Answer 4 · answered Feb 04 '22 at 15:43

Just Like Scriptator Answer but here is the Code for it

            from sklearn.preprocessing import OneHotEncoder
            from sklearn.compose import ColumnTransformer

            ct=ColumnTransformer(
transformers=[('encoder',OneHotEncoder(sparse=False),[0])],remainder="passthrough")
            g=ct.fit_transform(X)
        
            pd.DataFrame(g)

You Pass your Dataset in X Variable to ct.fit_transform()

the [0] in OneHotEncoder is the index of your categorical independent variable in dataset

transform scipy sparse csr to pandas?

4 Answers4

UPDATE for Pandas 1.0+

Sparse Pandas Dataframes

Previous Way

New Way

Working with SciPy sparse `csr_matrix`

Previous Way

New Way

Conversion from Sparse to Dense

Sparse Properties

Linked

transform scipy sparse csr to pandas?

4 Answers4

UPDATE for Pandas 1.0+

Sparse Pandas Dataframes

Previous Way

New Way

Working with SciPy sparse csr_matrix

Previous Way

New Way

Conversion from Sparse to Dense

Sparse Properties

Linked

Working with SciPy sparse `csr_matrix`