7

I want to make all values in an array column in my pyspark data frame negative without exploding (!). I tried this udf but it didn't work:

negative = func.udf(lambda x: x * -1, T.ArrayType(T.FloatType()))
cast_contracts = cast_contracts \
    .withColumn('forecast_values', negative('forecast_values'))

Can someone help?

Example data frame:

df = sc..parallelize(
   [Row(name='Joe', forecast_values=[1.0,2.0,3.0]),
    Row(name='Mary', forecast_values=[4.0,7.1])]).toDF()
>>> df.show()
    +----+---------------+
    |name|forecast_values|
    +----+---------------+
    | Joe|[1.0, 2.0, 3.0]|
    |Mary|     [4.0, 7.1]|
    +----+---------------+

Thanks

LN_P
  • 1,448
  • 4
  • 21
  • 37

2 Answers2

8

I know this is a year old post and so the solution I'm about to give may not have been an option previously (it's new to Spark 3). If you're using spark 3.0 and above in the PySpark API, you should consider using spark.sql.function.transform inside pyspark.sql.functions.expr. Please don't confuse spark.sql.function.transform with PySpark's transform() chaining. At any rate, here is the solution:

df.withColumn("negative", F.expr("transform(forecast_values, x -> x * -1)"))

Only thing you need to make sure is convert the values to int or float. The approach highlighted is much more efficient than exploding array or using UDFs.

mrammah
  • 205
  • 3
  • 10
5

It's just that you're not looping over the list values to multiply them with -1

import pyspark.sql.functions as F
import pyspark.sql.types as T

negative = F.udf(lambda x: [i * -1 for i in x], T.ArrayType(T.FloatType()))
cast_contracts = df \
    .withColumn('forecast_values', negative('forecast_values'))

You cannot escape the udf but the best possible way to do this. Works better if you have large lists:

import numpy as np

negative = F.udf(lambda x: np.negative(x).tolist(), T.ArrayType(T.FloatType()))
cast_contracts = abdf \
    .withColumn('forecast_values', negative('forecast_values'))
cast_contracts.show()
+------------------+----+
|   forecast_values|name|
+------------------+----+
|[-1.0, -2.0, -3.0]| Joe|
|            [-3.0]|Mary|
|      [-4.0, -7.1]|Mary|
+------------------+----+
pissall
  • 7,109
  • 2
  • 25
  • 45
  • Thanks. This returns an array of nulls. Maybe my array is an array of strings and I need to convert it to float first. Also it seems that my runtime increased by 12 minutes. Do you reckon this can be due just to the udf? – LN_P Oct 22 '19 at 13:05
  • 1
    @LN_P Yes, UDF will spoil your performance but there is no inbuilt functionality to operate on `array` type columns. How many rows are you working with? – pissall Oct 22 '19 at 13:13
  • 2
    `negative = F.udf(lambda x: [float(i) * -1 for i in x], T.ArrayType(T.FloatType()))` if it's string – pissall Oct 22 '19 at 13:16