I have this sparse Spark dataframe:
In [50]: data.show()
+---------+-------+---------+-------+-------+--------+
| pid| 111516| 387745|1211811|1857606| 2187005|
+---------+-------+---------+-------+-------+--------+
| 65197201| 0.0| 0.0|50239.0| 0.0| 0.0|
| 14040501|89827.0| 0.0| 0.0| 0.0| 0.0|
|887847003| 0.0| 0.0| 0.0| 0.0|190560.0|
|778121903| 0.0| 0.0| 0.0|95600.0| 0.0|
| 20907001| 0.0|8727749.0| 0.0| 0.0| 0.0|
+---------+-------+---------+-------+-------+--------+
I transform it into a two-column dataframe with index id and data as sparse vectors:
input_cols = [x for x in data.columns if x!='pid']
sparse_vectors = (VectorAssembler(inputCols=input_cols, outputCol="features").transform(data).select("pid", "features"))
In [46]: sparse_vectors.show()
+---------+-------------------+
| pid| features|
+---------+-------------------+
| 65197201| (5,[2],[50239.0])|
| 14040501| (5,[0],[89827.0])|
|887847003| (5,[4],[190560.0])|
|778121903| (5,[3],[95600.0])|
| 20907001|(5,[1],[8727749.0])|
+---------+-------------------+
In [51]: sparse_vectors.dtypes
Out[51]: [('pid', 'string'), ('features', 'vector')]
What is the most efficient way to convert this to any scipy.sparse type without collecting? I'm working with large matrices so it isn't a preferred option.