1

I have a column in libsvm format (ml library of spark) field1:value field2:value ...

+--------------+-----+
|      features|label|
+--------------+-----+
|   a:1 b:2 c:3|    0|
|   a:4 b:5 c:6|    0|
|   a:7 b:8 c:9|    1|
|a:10 b:11 c:12|    0|
+--------------+-----+

I want to extract the values and save them in arrays for each row in pyspark

features.printSchema()

root
 |-- features: string (nullable = false)
 |-- label: integer (nullable = true)

I am using the following udf because the column affected is part of a dataframe

from pyspark.sql.functions import udf
from pyspark.ml.linalg import Vectors

features_expl = udf(lambda features: Vectors.dense(features.split(" ")).map(lambda feat: float(str(feat.split(":")[1]))))
features=features.withColumn("feats", features_expl(features.features))

The result I get is: ValueError: could not convert string to float: mobile:0.0 It seems that it doesn't perform the second split and calls float() on a string.

What i would like to get is:

+--------------+-----+
|      features|label|
+--------------+-----+
|     [1, 2, 3]|    0|
|     [4, 5, 6]|    0|
|     [7, 8, 9]|    1|
|  [10, 11, 12]|    0|
+--------------+-----+
pault
  • 41,343
  • 15
  • 107
  • 149

1 Answers1

0

You have two major problems with your udf. Firstly, it doesn't work as you intended. Consider the heart of your code as the following function:

from pyspark.ml.linalg import Vectors
def features_expl_non_udf(features): 
    return Vectors.dense(
        features.split(" ")).map(lambda feat: float(str(feat.split(":")[1]))
    )

If you call it on one of your strings:

features_expl_non_udf("a:1 b:2 c:3")
#ValueError: could not convert string to float: a:1

Because features.split(" ") returns ['a:1', 'b:2', 'c:3'], which you are passing to the Vectors.dense constructor. This does not make any sense.

What you intended to do was first split on space, then split each value of the resultant list on :. Then you can convert these values to float and pass the list to Vectors.dense.

Here is the proper implementation of your logic:

def features_expl_non_udf(features): 
    return Vectors.dense(map(lambda feat: float(feat.split(":")[1]), features.split()))
features_expl_non_udf("a:1 b:2 c:3")
#DenseVector([1.0, 2.0, 3.0])

Now the second problem with your udf is that you didn't specify a returnType. For a DenseVector you need to use VectorUDT as the returnType.

from pyspark.sql.functions import udf
from pyspark.ml.linalg import VectorUDT

features_expl = udf(
    lambda features: Vectors.dense(
        map(lambda feat: float(feat.split(":")[1]), features.split())
    ),
    VectorUDT()
)
features.withColumn("feats", features_expl(features.features)).show()
#+--------------+-----+----------------+
#|      features|label|           feats|
#+--------------+-----+----------------+
#|   a:1 b:2 c:3|    0|   [1.0,2.0,3.0]|
#|   a:4 b:5 c:6|    0|   [4.0,5.0,6.0]|
#|   a:7 b:8 c:9|    1|   [7.0,8.0,9.0]|
#|a:10 b:11 c:12|    0|[10.0,11.0,12.0]|
#+--------------+-----+----------------+

As an alternative, you can do the string processing part on the spark side using regexp_replace and split but you'll still have to use a udf to convert the final output to a DenseVector.

from pyspark.sql.functions import regexp_replace, split, udf
from pyspark.ml.linalg import Vectors, VectorUDT

toDenseVector = udf(Vectors.dense, VectorUDT())

features.withColumn(
    "features",
    toDenseVector(
        split(regexp_replace("features", r"\w+:", ""), "\s+").cast("array<float>")
    )
).show()
#+----------------+-----+
#|        features|label|
#+----------------+-----+
#|   [1.0,2.0,3.0]|    0|
#|   [4.0,5.0,6.0]|    0|
#|   [7.0,8.0,9.0]|    1|
#|[10.0,11.0,12.0]|    0|
#+----------------+-----+
pault
  • 41,343
  • 15
  • 107
  • 149
  • Thanks a lot. So return type is only needed for UDF? Because you don't have it in the def code... – Georgios Kourogiorgas Jun 19 '19 at 17:36
  • @GeorgiosKourogiorgas yes the `def` is a pure python function. For `udf`s that will be used in spark, you need a `returnType`. If you don't specify, it will default to `StringType` – pault Jun 19 '19 at 17:37
  • What is the original features contains `None` value, such as `['"b":2','"a":1 "b":2 "c":3']` – rosefun Jul 29 '20 at 02:25