0

I have a Dataset like the following

monthYear    code
201601       11
201601       12
201601       12
201601       10
201602       null
201602       21
201602       21
201602       21
201603       null

When code is null I want to replace that with the code that appeared the most during the last month. For the above example, the first null will get replaced by 12 and the second one with 21.

So the result would be the following.

monthYear    code
201601       11
201601       12
201601       12
201601       10
201602       12
201602       21
201602       21
201602       21
201603       21

How can I achieve this?

stefanobaghino
  • 11,253
  • 4
  • 35
  • 63
MrG
  • 169
  • 2
  • 6
  • 19

2 Answers2

0

You can use the fill transformation of Class DataFrameNaFunctions https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/DataFrameNaFunctions.html#fill(double)

where you replace null or NaN values in the Dataframes

Example

val df = spark.read.json("../test.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> df.show
+----+----+
| age|name|
+----+----+
|  12| xyz|
|null| abc|
+----+----+

df.na.fill(0, Seq("age"))

res3.show
+---+----+
|age|name|
+---+----+
| 12| xyz|
|  0| abc|
+---+----+

Again, if you want replace value of this column

with res3.na.replace("age",Map(0 -> 51))

But Again it will not replace with different values you have to do it something on the source side

Sandeep Purohit
  • 3,652
  • 18
  • 22
0

You need to find the max with a window function and coalesce to get what you want.

Let us assume df is a dataframe(variable) that contains the table you showed,

df = df.selectExpr("*","count(code) over (partition by monthYear) as code_count")
df = df.selectExpr("*","rank(code) over (partition by monthYear order by code_count) as max_code")
df = df.selectExpr("*","coalesce(code,max_code) as code_new")

Would give you what you are looking for.