-2

I've created a dataframe with info about sales. Now I want to add a column (metric1) with booleans to the dataframe which values will depend on the sl.review field: if sl.review contains an empty string, then metric1 will be false and true otherwise if there is a review in sl.review.

val salesDf: DataFrame = salesRawDf.select($"stores", explode($"sales").as("sl"))
      .select($"stores.id", $"stores.name", $"sl.id", $"sl.current_sales", $"sl.review")

How is it possible to achieve with DataFrame? I've read this related question but still can't figure out how to implement this in my case.

samba
  • 2,821
  • 6
  • 30
  • 85

2 Answers2

2

You can use when function of spark like below

//Input df

+---+--------+-------+
|mid|   mname|mreview|
+---+--------+-------+
|100|     abc|       |
|101|     bcd|   Good|
|104|avengers|   Best|
|108|    Heri|       |
+---+--------+-------+

//Solution

import org.apache.spark.sql.functions._
df.withColumn("metric1", when(df.col("mreview") === "", false) otherwise true).show

//Output df

+---+--------+-------+-------+
|mid|   mname|mreview|metric1|
+---+--------+-------+-------+
|100|     abc|       |  false|
|101|     bcd|   Good|   true|
|104|avengers|   Best|   true|
|108|    Heri|       |  false|
+---+--------+-------+-------+

I hope this will help you

Manoj Kumar Dhakad
  • 1,862
  • 1
  • 12
  • 26
1

Try using this User Defined Function which takes more than one single value as parameter:

def reviewIsEmpty = udf((review: String) => {
  review.isEmpty
})

And then just call it creating the new column:

import spark.implicits._
salesDf.withColumn("metric1", reviewIsEmpty($"sl.review"))

Of course, you can change the behavior of the UDF to check that the string is not only composed by blank spaces, for example.

Hope it helps

Álvaro Valencia
  • 1,187
  • 8
  • 17