Spark 2.2 dataframe [scala]

Question

OrderNo    Status1    Status2     Status3
123    Completed      Pending     Pending
456    Rejected   Completed   Completed
789    Pending    In Progress     Completed

Above is the table which is the input data set and the expected output is below. The catch here is we should count based on the order no and not by no of status occurrences. Can we do this with the help of spark dataframes using scala? Appreciate your help in advance.

Pending     2
Rejected    1
Completed   3
In Progress 2

score 0 · Accepted Answer · answered Jan 17 '19 at 10:09

You can try the following code. It counts the number of distinct OrderNo for all the status. I hope it helps.

val rawDF = Seq(
  ("123", "Completed", "Pending", "Pending"),
  ("456", "Rejected", "Completed", "Completed"),
  ("789", "Pending", "In Progress", "Completed")
).toDF("OrderNo", "Status1", "Status2", "Status3")

val newDF = rawDF.withColumn("All_Status",  array($"Status1", $"Status2", $"Status3"))
    .withColumn("Status", explode($"All_Status"))
    .groupBy("Status").agg(size(collect_set($"OrderNo")).as("DistOrderCnt"))

Here are the results. (Note: In Progress only appears once in test data.)

+-----------+------------+ | Status|DistOrderCnt| +-----------+------------+ | Completed| 3| |In Progress| 1| | Pending| 2| | Rejected| 1| +-----------+------------+

This is awesome champ. You saved my day, It worked perfectly :) — Ansip, Jan 18 '19 at 01:58
I am glad it helped. Please accept the answer if it was helpful. — Apurba Pandey, Jan 18 '19 at 02:17

Spark 2.2 dataframe [scala]

1 Answers1