2

I have this sparse Spark dataframe:

In [50]: data.show()
+---------+-------+---------+-------+-------+--------+
|      pid| 111516|   387745|1211811|1857606| 2187005|
+---------+-------+---------+-------+-------+--------+
| 65197201|    0.0|      0.0|50239.0|    0.0|     0.0|
| 14040501|89827.0|      0.0|    0.0|    0.0|     0.0|
|887847003|    0.0|      0.0|    0.0|    0.0|190560.0|
|778121903|    0.0|      0.0|    0.0|95600.0|     0.0|
| 20907001|    0.0|8727749.0|    0.0|    0.0|     0.0|
+---------+-------+---------+-------+-------+--------+

I transform it into a two-column dataframe with index id and data as sparse vectors:

input_cols = [x for x in data.columns if x!='pid']
sparse_vectors = (VectorAssembler(inputCols=input_cols, outputCol="features").transform(data).select("pid", "features"))

In [46]: sparse_vectors.show()
+---------+-------------------+
|      pid|           features|
+---------+-------------------+
| 65197201|  (5,[2],[50239.0])|
| 14040501|  (5,[0],[89827.0])|
|887847003| (5,[4],[190560.0])|
|778121903|  (5,[3],[95600.0])|
| 20907001|(5,[1],[8727749.0])|
+---------+-------------------+
In [51]: sparse_vectors.dtypes
Out[51]: [('pid', 'string'), ('features', 'vector')]

What is the most efficient way to convert this to any scipy.sparse type without collecting? I'm working with large matrices so it isn't a preferred option.

xv70
  • 922
  • 1
  • 12
  • 27

1 Answers1

1

What is the sparse matrix supposed to look like?

Just eyeballing the table, and ignoring the pid headings I can generate a sparse matrix with:

In [456]: from scipy import sparse
In [457]: rows = [0,1,2,3,4]
In [458]: cols = [2,0,4,3,1]
In [459]: vals = [50239.0,89827.0,190560.0,95600,8727749]
In [460]: M = sparse.coo_matrix((vals,(rows,cols)),shape=(5,5))
In [461]: M
Out[461]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in COOrdinate format>
In [462]: M.A
Out[462]: 
array([[       0.,        0.,    50239.,        0.,        0.],
       [   89827.,        0.,        0.,        0.,        0.],
       [       0.,        0.,        0.,        0.,   190560.],
       [       0.,        0.,        0.,    95600.,        0.],
       [       0.,  8727749.,        0.,        0.,        0.]])

While I know the scipy end of things well, I don't know pyspark. Pandas has its own sparse representation, and some functions for creating scipy matrices from that. I've followed a few SO questions about that (which might be dated).

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • 1
    That's right, its easy to generate a sparse matrix collecting the data in the driver node of the cluster with scipy, but I would like to do this in a distributed way. – xv70 Oct 05 '17 at 15:34