0

Here is the scenario from legacy code need to converted to spark scala. Any pointers will be highly appreciated.

Scenario: I need to add a new column to a dataframe/dataset using "withColumn" based on conditionally checking 20 - 22 other columns values and then derive this new column value. Any pointers on how to implement this in spark Scala? Thanks much. I have tried using UDF and passing a map of 22 column as key:value and If else checking with mutable variables but was informed by the experts from this forum that's not recommended so seeking guidance on what is the right way to achieve this ?

OR using dataset.mappartitions and using the mutable variables inside this function is right way to do it ?

val calculate = dataset.mapPartitions(partition => partition.map(x => {
      var value1 = "NA"
      var value1  = "NA"
 
 set the values of the mutable variables value1 and value2 based on the column values
 if ( x.fieldA ="xyx")
 {
    value1 = "ABC"
    value2 = "cbz
 }
 eles if (x.fieldA ="112" & x.fieldB ="xy1")
{
    value1 = "zya"
    value2 = "ab"
}

    df(
        x.fldC
    x.fldB
    value1
    value2
      )
    }

case class df(fldc:String,fldb:String,value1:String:value2:String)

Can you please let me know what other details I should provide as I have updated the question above ?

I am new to this distributed/spark scala development so might be asking basic questions.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
novice8989
  • 149
  • 1
  • 11

1 Answers1

0
import spark.implicits._

import org.apache.spark.sql.functions._

val sourceDF = Seq(
  (1,2,3,4,5,6,7,8,9,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", true),
  (11,12,13,14,5,6,7,8,9,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", false),
  (1,2,3,4,25,26,27,8,9,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", true),
  (1,2,3,4,5,6,7,38,39,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", false),
  (1,2,3,4,5,6,7,8,9,410, "11", "12", "13", "14", "15", "16", "17", "18", "19", true)
).toDF("col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9", "col10",
  "col11", "col12", "col13", "col14", "col15", "col16", "col17", "col18", "col19", "col20")
  
val resDF = sourceDF
  .withColumn("col_result",
    when(
        'col1.equalTo(1) && 'col2.equalTo(2) && 'col3.equalTo(3) &&
        'col4.equalTo(4) && 'col5.equalTo(5) && 'col6.equalTo(6) &&
        'col7.equalTo(7) && 'col8.equalTo(8) && 'col9.equalTo(9) &&
        'col10.equalTo(10) && 'col11.equalTo("11") && 'col12.equalTo("12") &&
        'col13.equalTo("13") && 'col14.equalTo("14") && 'col15.equalTo("15") &&
        'col16.equalTo("16") && 'col17.equalTo("17") && 'col18.equalTo("18") &&
        'col19.equalTo("19") && 'col20.equalTo(true),"result").otherwise(null))
    
resDF.show(false)
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+
//  |col1|col2|col3|col4|col5|col6|col7|col8|col9|col10|col11|col12|col13|col14|col15|col16|col17|col18|col19|col20|col_result|
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+
//  |1   |2   |3   |4   |5   |6   |7   |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |result    |
//  |11  |12  |13  |14  |5   |6   |7   |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |false|null      |
//  |1   |2   |3   |4   |25  |26  |27  |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |null      |
//  |1   |2   |3   |4   |5   |6   |7   |38  |39  |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |false|null      |
//  |1   |2   |3   |4   |5   |6   |7   |8   |9   |410  |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |null      |
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+

Variant 2

val res1DF = sourceDF.withColumn("col_result_1",
  when(col("col8") === 38 || col("col20") === false, "good check" )
  .when(col("col10") === 410 && col("col17") === "17" && col("col20") === true, "next good check")
    .otherwise("we use when  many many")
)

res1DF.show(false)
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------------------+
//  |col1|col2|col3|col4|col5|col6|col7|col8|col9|col10|col11|col12|col13|col14|col15|col16|col17|col18|col19|col20|col_result_1          |
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------------------+
//  |1   |2   |3   |4   |5   |6   |7   |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |we use when  many many|
//  |11  |12  |13  |14  |5   |6   |7   |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |false|good check            |
//  |1   |2   |3   |4   |25  |26  |27  |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |we use when  many many|
//  |1   |2   |3   |4   |5   |6   |7   |38  |39  |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |false|good check            |
//  |1   |2   |3   |4   |5   |6   |7   |8   |9   |410  |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |next good check       |
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------------------+
mvasyliv
  • 1,214
  • 6
  • 10
  • Thank you @mvasyliv. In my case, challenge is how to translate the nested if else and else . For e.g If (co1=co2) { } else if ( co12='xfy' and co4='ABC) else if ( co12='qaw' and co4='ABC) { //nested if start // if (col1=Co5 and Col5 !=Unknown) {} else if (){} else {} // else { if(){} //nested else if {} else { } } if end here} else if {} – novice8989 Feb 07 '21 at 18:42
  • @novice8989 - Please, see Variant 2. Thank you – mvasyliv Feb 08 '21 at 08:31
  • Thank you. Sorry i am not getting it, how is this equivalent to if {} elseif { if else } else if else. for e.g If (col1 = "abc") { return a } else if (col1 = "bcd") { if (col2 = "xyc") { return "c" } else if (col2 = "yyy") { return "d" } else { if (col2 == "bbb") return "e" else return "k" } } else if (col1 ="ghy") {return "f"} else {return "p} – novice8989 Feb 09 '21 at 06:10
  • https://stackoverflow.com/questions/25157451/spark-sql-case-when-then – mvasyliv Feb 09 '21 at 09:06