112

I have a Pandas data frame object of shape (X,Y) that looks like this:

[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]

and a numpy sparse matrix (CSC) of shape (X,Z) that looks something like this

[[0, 1, 0],
[0, 0, 1],
[1, 0, 0]]

How can I add the content from the matrix to the data frame in a new named column such that the data frame will end up like this:

[[1, 2, 3, [0, 1, 0]],
[4, 5, 6, [0, 0, 1]],
[7, 8, 9, [1, 0, 0]]]

Notice the data frame now has shape (X, Y+1) and rows from the matrix are elements in the data frame.

Mihai Damian
  • 11,193
  • 11
  • 59
  • 81

5 Answers5

100
import numpy as np
import pandas as pd
import scipy.sparse as sparse

df = pd.DataFrame(np.arange(1,10).reshape(3,3))
arr = sparse.coo_matrix(([1,1,1], ([0,1,2], [1,2,0])), shape=(3,3))
df['newcol'] = arr.toarray().tolist()
print(df)

yields

   0  1  2     newcol
0  1  2  3  [0, 1, 0]
1  4  5  6  [0, 0, 1]
2  7  8  9  [1, 0, 0]
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 6
    I guess we can't really provide bulletproof shoes for users who insist on doing things like this :/ – Phillip Cloud Sep 05 '13 at 21:33
  • 9
    There are [interesting things you can do with a column of lists](http://stackoverflow.com/a/16637607/190597), so I'd rather not assume this is necessarily a bad idea. Though I agree there is a high chance that it is. – unutbu Sep 05 '13 at 21:41
  • 1
    That's a wonderful example of `pandas` flexibility. In the case of *this* question, the data are already of homogeneous numeric type with equal-shaped rows, whereas in that example they are `list`s of different length. I agree that there are interesting things you can do. However, when you've already got a matrix why turn it into a list of lists? – Phillip Cloud Sep 05 '13 at 21:47
  • In any case this question is a duplicate of http://stackoverflow.com/q/18641148/564538. – Phillip Cloud Sep 05 '13 at 21:53
  • 1
    The "interesting thing" there is... making it *not* a column of lists anymore (so it's useful)! – Andy Hayden Sep 05 '13 at 21:55
  • 71
    The world is a better place when creative people are allowed to do things everyone else thinks is stupid. :) – unutbu Sep 05 '13 at 22:00
  • :) There's definitely a place for objects in DataFrame... but it's none of those examples! – Andy Hayden Sep 05 '13 at 22:02
  • @unutbu "The world is a better place when creative people are allowed to do things everyone else thinks is stupid." So true. – Phillip Cloud Sep 05 '13 at 22:06
  • @unutbu Yes, but that doesn't change the fact that most deviations from the norm *are* stupid. – jpmc26 Sep 25 '19 at 12:41
  • It would be easier to understand if you remove scipy from example – Leszek Zarna Nov 24 '20 at 15:05
12
df = pd.DataFrame(np.arange(1,10).reshape(3,3))
df['newcol'] = pd.Series(your_2d_numpy_array)
Max Bileschi
  • 2,103
  • 2
  • 21
  • 19
11

Consider using a higher dimensional datastructure (a Panel), rather than storing an array in your column:

In [11]: p = pd.Panel({'df': df, 'csc': csc})

In [12]: p.df
Out[12]: 
   0  1  2
0  1  2  3
1  4  5  6
2  7  8  9

In [13]: p.csc
Out[13]: 
   0  1  2
0  0  1  0
1  0  0  1
2  1  0  0

Look at cross-sections etc, etc, etc.

In [14]: p.xs(0)
Out[14]: 
   csc  df
0    0   1
1    1   2
2    0   3

See the docs for more on Panels.

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 13
    Panel is now deprecated – guhur May 25 '17 at 10:00
  • 1
    Yes, usually MultiIndex is recommended nowadays. Created e.g. via `pd.concat([df, csc], axis=1, keys=["df", "csc"])`. – Andy Hayden May 25 '17 at 18:28
  • `A = np.eye(3); df = pd.concat( [A,A], axis=1 )` -> TypeError: cannot concatenate a non-NDFrame object in 20.2 ? (A wiki of "pandas-deprecated-now-use-this" would be nice.) – denis Aug 19 '17 at 11:22
  • @denis try `A = pd.DataFrame(np.eye(3)); df = pd.concat( [A,A], axis=1, keys=["A", "B"] )` – Andy Hayden Aug 19 '17 at 17:43
  • Thanks, `df.columns MultiIndex(levels=[[u'A', u'B'], [0, 1, 2]]` (slaps forehead) – denis Aug 20 '17 at 16:39
7

You can add and retrieve a numpy array from dataframe using this:

import numpy as np
import pandas as pd

df = pd.DataFrame({'b':range(10)}) # target dataframe
a = np.random.normal(size=(10,2)) # numpy array
df['a']=a.tolist() # save array
np.array(df['a'].tolist()) # retrieve array

This builds on the previous answer that confused me because of the sparse part and this works well for a non-sparse numpy arrray.

citynorman
  • 4,918
  • 3
  • 38
  • 39
5

Here is other example:

import numpy as np
import pandas as pd

""" This just creates a list of tuples, and each element of the tuple is an array"""
a = [ (np.random.randint(1,10,10), np.array([0,1,2,3,4,5,6,7,8,9]))  for i in 
range(0,10) ]

""" Panda DataFrame will allocate each of the arrays , contained as a tuple 
element , as column"""
df = pd.DataFrame(data =a,columns=['random_num','sequential_num'])

The secret in general is to allocate the data in the form a = [ (array_11, array_12,...,array_1n),...,(array_m1,array_m2,...,array_mn) ] and panda DataFrame will order the data in n columns of arrays. Of course , arrays of arrays could be used instead of tuples, in that case the form would be : a = [ [array_11, array_12,...,array_1n],...,[array_m1,array_m2,...,array_mn] ]

This is the output if you print(df) from the code above:

                       random_num                  sequential_num
0  [7, 9, 2, 2, 5, 3, 5, 3, 1, 4]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1  [8, 7, 9, 8, 1, 2, 2, 6, 6, 3]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
2  [3, 4, 1, 2, 2, 1, 4, 2, 6, 1]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
3  [3, 1, 1, 1, 6, 2, 8, 6, 7, 9]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
4  [4, 2, 8, 5, 4, 1, 2, 2, 3, 3]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
5  [3, 2, 7, 4, 1, 5, 1, 4, 6, 3]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
6  [5, 7, 3, 9, 7, 8, 4, 1, 3, 1]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
7  [7, 4, 7, 6, 2, 6, 3, 2, 5, 6]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
8  [3, 1, 6, 3, 2, 1, 5, 2, 2, 9]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
9  [7, 2, 3, 9, 5, 5, 8, 6, 9, 8]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Other variation of the example above:

b = [ (i,"text",[14, 5,], np.array([0,1,2,3,4,5,6,7,8,9]))  for i in 
range(0,10) ]
df = pd.DataFrame(data=b,columns=['Number','Text','2Elemnt_array','10Element_array'])

Output of df:

   Number  Text 2Elemnt_array                 10Element_array
0       0  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1       1  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
2       2  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
3       3  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
4       4  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
5       5  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
6       6  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
7       7  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
8       8  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
9       9  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

If you want to add other columns of arrays, then:

df['3Element_array']=[([1,2,3]),([1,2,3]),([1,2,3]),([1,2,3]),([1,2,3]),([1,2,3]),([1,2,3]),([1,2,3]),([1,2,3]),([1,2,3])]

The final output of df will be:

   Number  Text 2Elemnt_array                 10Element_array 3Element_array
0       0  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]      [1, 2, 3]
1       1  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]      [1, 2, 3]
2       2  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]      [1, 2, 3]
3       3  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]      [1, 2, 3]
4       4  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]      [1, 2, 3]
5       5  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]      [1, 2, 3]
6       6  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]      [1, 2, 3]
7       7  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]      [1, 2, 3]
8       8  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]      [1, 2, 3]
9       9  text       [14, 5]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]      [1, 2, 3]
jtlz2
  • 7,700
  • 9
  • 64
  • 114
Jorge Vilchis
  • 51
  • 1
  • 3