1

I have sparse data stored in a dataframe:

df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})

a   b   data
0   1   2   -0.824022
1   3   5   0.503239
2   5   5   -0.540105

Since I care about the null data the actual data would look like this:

true_df

    a   b   data
0   1   1   NaN
1   1   2   -0.824022
2   1   3   NaN
3   1   4   NaN
4   1   5   NaN
5   2   1   NaN
6   2   2   NaN
7   2   3   NaN
8   2   4   NaN
9   2   5   NaN
10  3   1   NaN
11  3   2   NaN
12  3   3   NaN
13  3   4   NaN
14  3   5   0.503239
15  4   1   NaN
16  4   2   NaN
17  4   3   NaN
18  4   4   NaN
19  4   5   NaN
20  5   1   NaN
21  5   2   NaN
22  5   3   NaN
23  5   4   NaN
24  5   5   -0.540105

My question is how do I construct true_df? I was hoping there was some way to use pd.concat or pd.merge, that is, construct a dataframe that is the shape of the dense table and then join the two dataframes but that doesn't join in the expected way (the columns are not combined). The ultimate goal is to pivot on a and b.

As a follow up because I think kinjo is correct, why does this only work for integers and not for floats? Using:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a':[1.0,1.3,1.5], 'b':[1.2,1.5,1.5], 'data':np.random.randn(3)})

### Create all possible combinations of a,b
newindex = [(b,a) for b in np.arange(1,df.b.max()+0.1, 0.1) for a in np.arange(1,df.a.max()+0.1,0.1)]

### Set the index as a,b and reindex
df.set_index(['a','b']).reindex(newindex).reset_index()

Will return:

    a   b   data
0   1.0 1.0 NaN
1   1.0 1.1 NaN
2   1.0 1.2 NaN
3   1.0 1.3 NaN
4   1.0 1.4 NaN
5   1.0 1.5 NaN
6   1.0 1.6 NaN
7   1.1 1.0 NaN
8   1.1 1.1 NaN
9   1.1 1.2 NaN
10  1.1 1.3 NaN
11  1.1 1.4 NaN
12  1.1 1.5 NaN
13  1.1 1.6 NaN
14  1.2 1.0 NaN
15  1.2 1.1 NaN
16  1.2 1.2 NaN
17  1.2 1.3 NaN
18  1.2 1.4 NaN
19  1.2 1.5 NaN
20  1.2 1.6 NaN
21  1.3 1.0 NaN
22  1.3 1.1 NaN
23  1.3 1.2 NaN
24  1.3 1.3 NaN
25  1.3 1.4 NaN
26  1.3 1.5 NaN
27  1.3 1.6 NaN
28  1.4 1.0 NaN
29  1.4 1.1 NaN
30  1.4 1.2 NaN
31  1.4 1.3 NaN
32  1.4 1.4 NaN
33  1.4 1.5 NaN
34  1.4 1.6 NaN
35  1.5 1.0 NaN
36  1.5 1.1 NaN
37  1.5 1.2 NaN
38  1.5 1.3 NaN
39  1.5 1.4 NaN
40  1.5 1.5 NaN
41  1.5 1.6 NaN
42  1.6 1.0 NaN
43  1.6 1.1 NaN
44  1.6 1.2 NaN
45  1.6 1.3 NaN
46  1.6 1.4 NaN
47  1.6 1.5 NaN
48  1.6 1.6 NaN
mnky9800n
  • 1,113
  • 2
  • 15
  • 33

3 Answers3

1

Since you intend to pivot an a and b, you could obtain the pivoted result with

import numpy as np
import pandas as pd
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})

result = pd.DataFrame(np.nan, index=range(1,6), columns=range(1,6))
result.update(df.pivot(index='a', columns='b', values='data'))
print(result)

which yields

    1         2   3   4         5
1 NaN  0.436389 NaN NaN       NaN
2 NaN       NaN NaN NaN       NaN
3 NaN       NaN NaN NaN -1.066621
4 NaN       NaN NaN NaN       NaN
5 NaN       NaN NaN NaN  0.328880
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
1

Reindex is a straightforward solution. Similar to @jezrael's solution, but no need for merge.

import pandas as pd
import numpy as np

df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})

### Create all possible combinations of a,b
newindex = [(b,a) for b in range(1,df.b.max()+1) for a in range(1,df.a.max()+1)]

### Set the index as a,b and reindex
df.set_index(['a','b']).reindex(newindex)

You can then reset the index if you want the numeric count as your overall index.

In the case that your index is floats you should use linspace and not arange:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a':[1.0,1.3,1.5], 'b':[1.2,1.5,1.5], 'data':np.random.randn(3)})

### Create all possible combinations of a,b
newindex = [(b,a) for b in np.linspace(a_min, a_max, a_step, endpoint=False) for a in np.linspace(b_min, b_max, b_step, endpoint=False)]

### Set the index as a,b and reindex
df.set_index(['a','b']).reindex(newindex).reset_index()
mnky9800n
  • 1,113
  • 2
  • 15
  • 33
kinjo
  • 76
  • 2
  • see my edit above: follow up question, if we change `a` and `b` to be floats, this fails. Why? example: `[1] df = pd.DataFrame({'a':[1.0,1.3,1.5], 'b':[1.2,1.5,1.5], 'data':np.random.randn(3)})` `[2] newindex = [(b,a) for b in np.arange(1,df.b.max()+0.1, 0.1) for a in np.arange(1,df.a.max()+0.1,0.1)]` – mnky9800n May 13 '16 at 15:41
  • @mnky9800n Not immediately sure why reindex fails on floats. The following stack issue somewhat discusses: http://stackoverflow.com/questions/17351233/using-pandas-reindex-with-floats-interpolation – kinjo May 13 '16 at 17:34
  • I'm assuming it has to do with messy floating point errors. The [numpy.arange documentation](http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.arange.html) mentions this which coincides with the stack issue you posted. – mnky9800n May 15 '16 at 13:58
  • Rounding a,b fixed the issue for me. Floating point accuracy was causing the issue. `(round(a,1), round(b,1)) ...` – kinjo May 17 '16 at 18:11
0

This is a nice fast approach for converting numeric data from sparse to dense, using SciPy's sparse functionality. Works if your ultimate goal is the pivoted (i.e. dense) dataframe:

import pandas as pd
from scipy.sparse import csr_matrix

df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
df_shape = df['a'].max()+1, df['b'].max()+1

sp_df = csr_matrix((df['data'], (df['a'], df['b'])), shape=df_shape)
df_dense = pd.DataFrame.sparse.from_spmatrix(sp_df)
s_pike
  • 1,710
  • 1
  • 10
  • 22