14

Current pandas version: 0.22


I have a SparseDataFrame.

A = pd.SparseDataFrame(
    [['a',0,0,'b'],
     [0,0,0,'c'],
     [0,0,0,0],
     [0,0,0,'a']])

A

   0  1  2  3
0  a  0  0  b
1  0  0  0  c
2  0  0  0  0
3  0  0  0  a

Right now, the fill values are 0. However, I'd like to change the fill_values to np.nan. My first instinct was to call replace:

A.replace(0, np.nan)

But this gives

TypeError: cannot convert int to an sparseblock

Which doesn't really help me understand what I'm doing wrong.

I know I can do

A.to_dense().replace(0, np.nan).to_sparse()

But is there a better way? Or is my fundamental understanding of Sparse dataframes flawed?

cs95
  • 379,657
  • 97
  • 704
  • 746
  • Even mask returns an error.I tried today – Bharath M Shetty Jan 09 '18 at 04:31
  • Yes, mask and where, they return something else... that's for another question I suppose. – cs95 Jan 09 '18 at 04:31
  • Sparse DataFrames is new so I don't think there's another way than `to_dense` for replacement. I think replacement with nans breaks the structure of sparse. – Bharath M Shetty Jan 09 '18 at 04:39
  • 1
    I tried it also , then I give it up ... – BENY Jan 09 '18 at 04:52
  • EDIT: Adding to the bounty message, I'd like to understand why I receive this error and what the canonical way of doing this would be. Thanks! – cs95 Jan 11 '18 at 06:09
  • This is really strange. `A.replace(0, np.nan)` works fine in my computer! What pandas version are you using? I'm using 20.1 . – Qusai Alothman Jan 11 '18 at 12:26
  • Oh. I've just updated my pandas version, and the error you mentioned occurred ! . Something changed between 20.1 and the current version of pandas. Maybe it's a bug? – Qusai Alothman Jan 11 '18 at 12:30
  • Sparse DataFrame has so many bugs to be fixed. Its better we report this in github. – Bharath M Shetty Jan 11 '18 at 14:08
  • @JohnE I guess that was one of my "fundamental misunderstandings", where I thought `0` is the default fill value. I did not consider adding the extra argument. However, that still doesn't fix the `replace` issue. – cs95 Jan 11 '18 at 21:52
  • 1
    OK, gotcha. Just checking you understood fill_value (since your example doesn't even save space). I don't know exactly what's going on here but generally speaking the available operations you can do with a sparse df are quite a bit less than with a regular df, so I don't think results like you find here are especially rare unfortunately. – JohnE Jan 11 '18 at 22:01
  • @JohnE That's unfortunate. This is a promising API which I'd like to get my hands dirty with. Guess I'll have to stick to scipy's API. – cs95 Jan 11 '18 at 22:03
  • @Dark SparseDataFrame's `mask` and `where` never worked in any version of pandas in the first place :). – Qusai Alothman Jan 12 '18 at 06:38
  • @QusaiAlothman well it should have worked. As I said it needs a little more extra attention for bug fixes. Replacing 0 with nan is the work of `mask`. – Bharath M Shetty Jan 12 '18 at 06:40
  • @Dark I know, but it seems that everything in SparseDataFrame is broken. Simple arithmetic methods like `abs` and `sum` don't work either ! – Qusai Alothman Jan 12 '18 at 06:50

2 Answers2

15

tl;dr : That's definitely a bug.
But please keep reading, there is more than that...

All the following works fine with pandas 0.20.3, but not with any newer version:

A.replace(0,np.nan)
A.replace({0:np.nan})
A.replace([0],[np.nan])

etc... (you get the idea).

(from now on, all the code is done with pandas 0.20.3).

However, those (along with most the workarounds I tried) works because we accidentally did something wrong. You'll guess it right away if we do this:

A.density

1.0

This SparseDataFrame is actually dense!
We can fix this by passing default_fill_value=0 :

A = pd.SparseDataFrame(
     [['a',0,0,'b'],
     [0,0,0,'c'],
     [0,0,0,0],
     [0,0,0,'a']],default_fill_value=0)

Now A.density will output 0.25 as expected.

This happened because the initializer couldn't infer the dtypes of the columns. Quoting from pandas docs:

Sparse data should have the same dtype as its dense representation. Currently, float64, int64 and bool dtypes are supported. Depending on the original dtype, fill_value default changes:

  • float64: np.nan
  • int64: 0
  • bool: False

But the dtypes of our SparseDataFrame are:

A.dtypes

0    object
1    object
2    object
3    object
dtype: object

And that's why SparseDataFrame couldn't decide which fill value to use, and thus used the default np.nan.

OK, so now we have a SparseDataFrame. Let's try to replace some entries in it:

A.replace('a','z')
    0   1   2   3
0   z   0   0   b
1   0   0   0   c
2   0   0   0   0
3   0   0   0   z
And strangely:
A.replace(0,np.nan)
    0   1   2   3
0   a   0   0   b
1   0   0   0   c
2   0   0   0   0
3   0   0   0   a
And that's as you can see, is not correct!
From my own experiments with different versions of pandas, it seems that SparseDataFrame.replace() works only with non-fill values. To change the fill value, you have the following options:

  • According to pandas docs, if you change the dtypes, that will automatically change the fill value. (That didn't work with me).
  • Convert into a dense DataFrame, do the replacement, then convert back into SparseDataFrame.
  • Manually reconstruct a new SparseDataFrame, like Wen's answer, or by passing default_fill_value set to the new fill value.

While I was experimenting with the last option, something even stranger happened:

B = pd.SparseDataFrame(A,default_fill_value=np.nan)

B.density
0.25

B.default_fill_value
nan

So far, so good. But... :

B
    0   1   2   3
0   a   0   0   b
1   0   0   0   c
2   0   0   0   0
3   0   0   0   a

That really shocked me at first. Is that even possible!?
Continuing on, I tried to see what is happening in the columns:

B[0]

0    a
1    0
2    0
3    0
Name: 0, dtype: object
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([1], dtype=int32)

The dtype of the column is object, but the dtype of the BlockIndex associated with it is int32, hence the strange behavior.
There is a lot more "strange" things going on, but I'll stop here.
From all the above, I can say that you should avoid using SparseDataFrame till a complete re-write for it takes place :).

Qusai Alothman
  • 1,982
  • 9
  • 23
  • Interesting read. So, your verdict is this API is still in its infancy, and not yet suitable for use in a production environment? – cs95 Jan 11 '18 at 19:21
  • @cᴏʟᴅsᴘᴇᴇᴅ No, not really. FWIK, the sparse types were introduced in pandas 0.16 (as I remember). I've been heavily using `SparseDataFrame` without much issues, but I was using old versions of pandas. It seems that only the new versions have a lot of critical bugs. – Qusai Alothman Jan 11 '18 at 19:26
  • @cᴏʟᴅsᴘᴇᴇᴅ I've been playing around with SparseDataFrame since yesterday. Almost everything is broken in the latest version of pandas!. Even simple arithmetic methods (like abs and sum) don't work. I'd suggest using SparseDataFrame only if you want to save space on disk when pickling (at least this is working!). – Qusai Alothman Jan 12 '18 at 06:47
  • Or downgrade to 0.20.3 :D – Qusai Alothman Jan 12 '18 at 06:48
6

This what I have tried

pd.SparseDataFrame(np.where(A==0, np.nan, A))

     0    1    2    3
0    a  NaN  NaN    b
1  NaN  NaN  NaN    c
2  NaN  NaN  NaN  NaN
3  NaN  NaN  NaN    a
cs95
  • 379,657
  • 97
  • 704
  • 746
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Yeah, this is another workaround! I think there's something fundamental about these structures that I'm missing. – cs95 Jan 09 '18 at 05:00
  • @cᴏʟᴅsᴘᴇᴇᴅ yeah , I am looking into the source code , since Sparse is base on scipy – BENY Jan 09 '18 at 05:02
  • @cᴏʟᴅsᴘᴇᴇᴅ and immutable is the problem . The change will be blocked – BENY Jan 09 '18 at 05:04