-1
OrderNo    Status1    Status2     Status3
123    Completed      Pending     Pending
456    Rejected   Completed   Completed
789    Pending    In Progress     Completed

Above is the table which is the input data set and the expected output is below. The catch here is we should count based on the order no and not by no of status occurrences. Can we do this with the help of spark dataframes using scala? Appreciate your help in advance.

Pending     2
Rejected    1
Completed   3
In Progress 2
Ansip
  • 73
  • 1
  • 9

1 Answers1

0

You can try the following code. It counts the number of distinct OrderNo for all the status. I hope it helps.

val rawDF = Seq(
  ("123", "Completed", "Pending", "Pending"),
  ("456", "Rejected", "Completed", "Completed"),
  ("789", "Pending", "In Progress", "Completed")
).toDF("OrderNo", "Status1", "Status2", "Status3")

val newDF = rawDF.withColumn("All_Status",  array($"Status1", $"Status2", $"Status3"))
    .withColumn("Status", explode($"All_Status"))
    .groupBy("Status").agg(size(collect_set($"OrderNo")).as("DistOrderCnt"))

Here are the results. (Note: In Progress only appears once in test data.)

+-----------+------------+ | Status|DistOrderCnt| +-----------+------------+ | Completed| 3| |In Progress| 1| | Pending| 2| | Rejected| 1| +-----------+------------+

Apurba Pandey
  • 1,061
  • 10
  • 21