0

I'm using Spark v2.4.0 and I occurred a strange phenomenon:

I have rather simple dataframe function and use it on some dataframe called "new":

from product_mapping import product_mapping
    
new2 = product_mapping(new)   
new2.show()

product mapping (it's a separate python script due to length of the statement)

import pyspark.sql.functions as F

def product_mapping(df):

    df = df.withColumn('PRODUCT',  F.when((df.var1 == "301") & (df.var2 == 0) & (df.var3 == 30), F.lit('101')) 
                              .when((df.var1 == "301") & (df.var2 == 1) & (df.var3 == 30), F.lit('102'))  
                              .when((df.var1 == "302") & (df.var2 == 0) & (df.var3 == 31), F.lit('103'))  
                              .when((df.var1 == "302") & (df.var2 == 1) & (df.var3 == 31), F.lit('104'))  
                              .when((df.var1 == "303") & (df.var2 == 0) & (df.var3 == 61), F.lit('105'))  
                              .when((df.var1 == "303") & (df.var2 == 0) & (df.var3 == 32), F.lit('106'))  
                              .when((df.var1 == "303") & (df.var2 == 1) & (df.var3 == 32), F.lit('107'))  
                              .when((df.var1 == "303") & (df.var2 == 1) & (df.var3 == 61), F.lit('108'))  
                              .when((df.var1 == "304") & (df.var2 == 0) & (df.var3 == 69), F.lit('109')) 
    (many more WHEN lines) 
                              .when((df.var1 == "304") & (df.var2 == 1) & (df.var3 == 69), F.lit('205')))

    return df

In total I have some > 150 lines, but the code does not seem to work; it throws up error:

py4j.protocol.Py4JJavaError: An error occurred while calling o1754.showString.
: java.lang.StackOverflowError
        at org.codehaus.janino.CodeContext.extract16BitValue(CodeContext.java:720)
        at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:561)

However, when I shorten the statement to let's say 5 WHEN statements, the code works fine ... so is there a max number of WHEN statements to use? And how to overcome this?

Thanks

Niels
  • 141
  • 8
  • you could concatenate the var columns and create a mapping with the desired output: https://stackoverflow.com/questions/42980704/pyspark-create-new-column-with-mapping-from-a-dict – ScootCork Oct 21 '22 at 16:23
  • Hi, okay yes, looks like a possible solution, but how would the mapping look like with multiple arguments? Sorry never used this. – Niels Oct 21 '22 at 16:56

1 Answers1

0

Here are my 2 cents: (The idea here is to concatenate three variables and then map based on the dictionary of key-value pairs)

  1. Declare the function as follows:

     def product_mapping(df):
    
         conditions = {  '301-1-30' : '102',
                         '302-0-31' : '103',
                         '302-1-31' : '104',
                         '303-0-61' : '105',
                         '301-0-30' : '106',
                         '303-1-32' : '107',
                         '303-1-61' : '108',
                         '304-0-69' : '109',
                         '301-0-31' : '114'
                     }
    
         df = df.withColumn('PRODUCT', fx.concat(df.var1, fx.lit('-'), df.var2, fx.lit('-'),df.var3))
         df1 = df.replace(to_replace=conditions, subset=['PRODUCT'])\
                 .withColumn('PRODUCT',fx.col('PRODUCT').cast(IntegerType()))
         return df1
    
  2. Create a data frame assuming it has 3 required columns - var1, var2, var3

     new  = spark.createDataFrame([(301,0,30),(301,1,30),(301,0,31)],schema = ['var1','var2','var3'])
    
  3. Call the function and print the returned data frame:

     new2 = product_mapping(new)   
     new2.show()
    

Please refer the below screenshot for reference: enter image description here

Banu
  • 146
  • 5