1

I've searched and can't find a suitable answer for my Pyspark issue. I'm looking for an alternative approach which is more efficient and doesn't use a UDF.

I have a simple equation in a UDF which has inputs from (a)literal constant, (b)column values, and (c)values from a list (or dict). The output must be created multiple times and stored in an array. Is it possible to do this outside of a UDF?

I've knocked up this simple example, although my actual issue is slightly more complex with more rows, a bigger equation, & loops over 40 times:

NOTE: V3 example question:

from pyspark.sql.functions import *
from pyspark.sql.types import *

test_data = [("A1",10.5), ("A2",40.5), ("A3",60.5)]

schema = StructType([ \
    StructField("ID",StringType(),True), \
    StructField("num1",DoubleType(),True)])
 
df = spark.createDataFrame(data=test_data,schema=schema)

const1 = 10
const2 = 20
num_lst1 = [2.1,4.2,6.3,8.4,10.5]
num_lst2 = [20,40,60,80,100]
num_lst3 = [100.1,200.2,300.3,400.4,500.5]

def udf_whatever(num_lst1,num_lst2,num_lst3):
    def whatever(const1, const2, val1):
        DH = [None for t in range(5)]
        for i in range(5):
            DH[i] = const1+val1+const2+(num_lst1[i]*num_lst2[i])+num_lst3[i]
        return DH
    return udf(whatever, ArrayType(DoubleType()))

df2 = df.withColumn("UDF_OUT",udf_whatever(num_lst1,num_lst2,num_lst3)(lit(const1),lit(const2),col("num1")))
df2.show(truncate=False)

+---+----+-------------------------------------+
|ID |num1|UDF_OUT                              |
+---+----+-------------------------------------+
|A1 |10.5|[182.6, 408.7, 718.8, 1112.9, 1591.0]|
|A2 |40.5|[212.6, 438.7, 748.8, 1142.9, 1621.0]|
|A3 |60.5|[232.6, 458.7, 768.8, 1162.9, 1641.0]|
+---+----+-------------------------------------+

With Emma's help (in comments) I've got this to work but seems a little expensive to create new columns per list, especially with millions of rows. Is there a better way?

df3 = df.withColumn('MAP_LIST1', array(*map(lit, num_lst1)))\
        .withColumn('MAP_LIST2', array(*map(lit, num_lst2)))\
        .withColumn('MAP_LIST3', array(*map(lit, num_lst3)))\
          .withColumn('EQUATION_OUT', expr(f"""transform(MAP_LIST1, (x, i) -> {const1} + num1 + {const2} + (x * MAP_LIST2[i]) + MAP_LIST3[i])"""))
df3.show()

Any help much appreciated! Rick

1 Answers1

1

One way to do this is to use array_repeat and transform.

First, use array_repeat to create the base array with just the num3 values.

Then, use transform to calculate the value for each num3 value in the array.

For Spark 3.1+

repeat = 5
const = 10

df = (df.withColumn('arr', array_repeat('num3', repeat))
      .withColumn('arr', transform(col('arr'), lambda x, i: lit(const) + col('num1') + col('num2') + i * x)))

For Spark 2.4+ < 3.1

df = (df.withColumn('arr', array_repeat('num3', repeat))
      .withColumn('arr', expr('transform(arr, (x, i) -> 10 + num1 + num2 + i * x)')))

============================================================

Update with the new equation (const + col + list element) If there is only 1 array (num_lst), you can initialize the UDF_OUT with the array and do transform to add other variables to the UDF_OUT.

df = (df.withColumn('UDF_OUT', array(*map(lit, num_lst)))
      .withColumn('UDF_OUT', expr(f"""
          transform(UDF_OUT, (x, i) -> {const} + num1 + x)
      """)))
Emma
  • 8,518
  • 1
  • 18
  • 35
  • your answer was great but my question wasn't quite correct, sorry! I've updated the question and example slightly if you could take another look. I tried using your transform code though but was struggling to use a LIST (or dict) as an input. I'm using spark v2.4.2 – Rick Paddock May 26 '22 at 10:52
  • I updated with your new samples. This is based on your input so if you have multiple lists or a dict instead of list, you need to adjust the transform syntax. – Emma May 26 '22 at 14:33
  • 1
    I see what you've done. You've created a new column based on the list, and then you're transforming that and using it's values as 'x'. Interesting! I do have multiple lists/dictionaries so I'll see if I can tweak your code to include them! – Rick Paddock May 26 '22 at 15:48
  • I could initialize the array with dummy data but if you have an array that will be in a part of equation, I just shortcut to initialize with the array. In short, you'll need an initial array to do `transform`. – Emma May 26 '22 at 15:52
  • 1
    I think I got it, how exciting! I created another list so I have num_list1 & num_list2. I kept the code as you had it but included another `withColumn` before the transform to bring in num_list2 and I called it MAP_LIST2 (so now we have both lists added as columns). I then added this to the transform and it worked!: `+ MAP_LIST2[i]`. However, isn't this a little inefficient using map to bring in the list as a column and storing it across all rows? I have 14 million rows in my data. – Rick Paddock May 27 '22 at 07:26
  • Hog big is the list? If small (<100), can those 2 list be calculated outside of spark? You need only 1 array to do `transform`, other lists can be served in a different way. You can also post another question with regards to the optimization of the current code. Please include 2 lists and dict parts close to your code in question so that we can better help you. – Emma May 27 '22 at 14:05
  • lists are no bigger than 40 items and there are about 20 of them in total. I've updated the question again with a new example. I got it working using a version of your example but it required making loads of new columns containing arrays. There must be a more efficient way. – Rick Paddock May 30 '22 at 13:11
  • If it is less than 40 items, I think it is still efficient to do outside of spark for the part where the values are constant (const1,2 num_lst1,2,3 are constant). `[equation for i in range(len(num_lst1))]`. If this doesn't help, try posting a new question. This question became apart from the original question. Btw, can you write the equation in recurrence relation or this is totally custom equation? – Emma May 30 '22 at 20:45
  • I've marked answer as complete as I've got it working BUT in my example I stupidly forgot to add array's in the initial dataframe (so imagine num1 was an array of 5 items and not a single value) and the equation contained: num1[i] * num_list[i] ..... so I need to use item1 in the dataframe array with item1 in the external list and so on. Sorry I edit the initial question with a new example as I don't have pyspark in front of me to edit the code/results accurately. – Rick Paddock Jun 01 '22 at 07:25
  • please open a new question. we cannot keep editing. – Emma Jun 01 '22 at 13:18