I have a column in libsvm format (ml library of spark) field1:value field2:value ...
+--------------+-----+
| features|label|
+--------------+-----+
| a:1 b:2 c:3| 0|
| a:4 b:5 c:6| 0|
| a:7 b:8 c:9| 1|
|a:10 b:11 c:12| 0|
+--------------+-----+
I want to extract the values and save them in arrays for each row in pyspark
features.printSchema()
root
|-- features: string (nullable = false)
|-- label: integer (nullable = true)
I am using the following udf because the column affected is part of a dataframe
from pyspark.sql.functions import udf
from pyspark.ml.linalg import Vectors
features_expl = udf(lambda features: Vectors.dense(features.split(" ")).map(lambda feat: float(str(feat.split(":")[1]))))
features=features.withColumn("feats", features_expl(features.features))
The result I get is: ValueError: could not convert string to float: mobile:0.0 It seems that it doesn't perform the second split and calls float() on a string.
What i would like to get is:
+--------------+-----+
| features|label|
+--------------+-----+
| [1, 2, 3]| 0|
| [4, 5, 6]| 0|
| [7, 8, 9]| 1|
| [10, 11, 12]| 0|
+--------------+-----+