Apache spark dealing with case statements

Question

I am dealing with transforming SQL code to PySpark code and came across some SQL statements. I don't know how to approach case statments in pyspark? I am planning on creating a RDD and then using rdd.map and then do some logic checks. Is that the right approach? Please help!

Basically I need to go through each line in the RDD or DF and based on some logic I need to edit one of the column values.

     case  
               when (e."a" Like 'a%' Or e."b" Like 'b%') 
                And e."aa"='BW' And cast(e."abc" as decimal(10,4))=75.0 Then 'callitA'

               when (e."a" Like 'b%' Or e."b" Like 'a%') 
                And e."aa"='AW' And cast(e."abc" as decimal(10,4))=75.0 Then 'callitB'

else

'CallitC'

Why do you need to convert anything? PySpark can run SparkSQL just fine — OneCricketeer, Oct 11 '16 at 16:29
Because it is a long SQL case statement(20 lines). i would rather do it pragmatically using some logic. — Amardeep Flora, Oct 11 '16 at 16:37
You could use [`pyspark.sql.functions.when()`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.when). Not sure how that handles multiple cases, though — OneCricketeer, Oct 11 '16 at 16:42
you could write all of this as logic in a map function. have you tried that? — Kristian, Oct 11 '16 at 21:53

Shantanu Sharma · Accepted Answer · 2019-06-03T15:10:59.447

These are few ways to write If-Else / When-Then-Else / When-Otherwise expression in pyspark.

Sample dataframe

df = spark.createDataFrame([(1,1),(2,2),(3,3)],['id','value'])

df.show()

#+---+-----+
#| id|value|
#+---+-----+
#|  1|    1|
#|  2|    2|
#|  3|    3|
#+---+-----+

#Desired Output:
#+---+-----+----------+
#| id|value|value_desc|
#+---+-----+----------+
#|  1|    1|       one|
#|  2|    2|       two|
#|  3|    3|     other|
#+---+-----+----------+

Option#1: withColumn() using when-otherwise

from pyspark.sql.functions import when

df.withColumn("value_desc",when(df.value == 1, 'one').when(df.value == 2, 'two').otherwise('other')).show()

Option#2: select() using when-otherwise

from pyspark.sql.functions import when

df.select("*",when(df.value == 1, 'one').when(df.value == 2, 'two').otherwise('other').alias('value_desc')).show()

Option3: selectExpr() using SQL equivalent CASE expression

df.selectExpr("*","CASE WHEN value == 1 THEN  'one' WHEN value == 2 THEN  'two' ELSE 'other' END AS value_desc").show()

SQL like expression can also be written in withColumn() and select() using pyspark.sql.functions.expr function. Here are examples.

Option4: select() using expr function

from pyspark.sql.functions import expr 

df.select("*",expr("CASE WHEN value == 1 THEN  'one' WHEN value == 2 THEN  'two' ELSE 'other' END AS value_desc")).show()

Option5: withColumn() using expr function

from pyspark.sql.functions import expr 

df.withColumn("value_desc",expr("CASE WHEN value == 1 THEN  'one' WHEN value == 2 THEN  'two' ELSE 'other' END AS value_desc")).show()

Output:

#+---+-----+----------+
#| id|value|value_desc|
#+---+-----+----------+
#|  1|    1|       one|
#|  2|    2|       two|
#|  3|    3|     other|
#+---+-----+----------+

Ram Ghadiyaram · Answer 2 · 2017-08-24T17:29:38.417

Im not good in python. But will try to give some pointers of what I have done in scala.

Question : rdd.map and then do some logic checks. Is that the right approach?

Its one approach.

withColumn is another approach

DataFrame.withColumn method in pySpark supports adding a new column or replacing existing columns of the same name.

In this context you have to deal with Column via - spark udf or when otherwise syntax

for example :

from pyspark.sql import functions as F
df.select(df.name, F.when(df.age > 4, 1).when(df.age < 3, -1).otherwise(0)).show()


+-----+--------------------------------------------------------+
| name|CASE WHEN (age > 4) THEN 1 WHEN (age < 3) THEN -1 ELSE 0|
+-----+--------------------------------------------------------+
|Alice|                                                      -1|
|  Bob|                                                       1|
+-----+--------------------------------------------------------+


from pyspark.sql import functions as F
df.select(df.name, F.when(df.age > 3, 1).otherwise(0)).show()

+-----+---------------------------------+
| name|CASE WHEN (age > 3) THEN 1 ELSE 0|
+-----+---------------------------------+
|Alice|                                0|
|  Bob|                                1|
+-----+---------------------------------+

you can use udf instead of when otherwise as well.

if you are okay please care to accept [the answer as owner](https://meta.stackexchange.com/a/5235/369717) and [vote-up](https://meta.stackexchange.com/a/173400/369717) — Ram Ghadiyaram, Jun 28 '19 at 16:11

Apache spark dealing with case statements

2 Answers2

Question : `rdd.map` and then do some logic checks. Is that the right approach?

Linked

Related

Apache spark dealing with case statements

2 Answers2

Question : rdd.map and then do some logic checks. Is that the right approach?

Linked

Related

Question : `rdd.map` and then do some logic checks. Is that the right approach?