How to calculate distinct rows frequencies in dataset

Question

I want to calculate the number of occurences of distinct rows, according to some attributes, and return the results for each distinct row :

I am trying to groupBy to get distinct rows according to values of rows in col3 and col4 columns with :

dF.select("col0","col1","col2","col3","col4").groupBy("col3","col4")

But then how to calculate the occurence of each distinct "grouped" row ?

Thank you for the help.

Are you looking for [count](https://stackoverflow.com/a/46417164/2129801)? — werner, Apr 24 '21 at 12:44
Yes `agg` and `count` but after grouping each distinct row, because at the end i need to have the count (from the original dataset) of each grouped distinct row. (after rows are grouped the count is 1 for each, but i want the original count while grouping the rows) i hope it is clear — Martin Moore, Apr 24 '21 at 12:54
maybe you could provide some example data (input data and expected output)? — werner, Apr 24 '21 at 12:59

Coursal · Accepted Answer · 2021-04-24T14:43:15.503

You clarifications are not quite clear, but I believe you are trying to add a column to dF with the count occurrences of the rows after a groupBy based on columns col3 and col4.

Let's say we have the following DataFrame:

+----+----+----+-----+----+
|col0|col1|col2| col3|col4|
+----+----+----+-----+----+
|   0|  10| 100|cat26|30.9|
|   1|  20| 200|cat13|22.1|
|   2|  30| 300|cat26|30.9|
|   3|  40| 400|cat26|30.9|
|   4|  50| 500|cat15|15.3|
|   5|  60| 600|cat13|22.1|
+----+----+----+-----+----+

First we need to group and aggregate the rows per col3 and col4 as werner implied in a DataFrame named grouped.

// Scala
val grouped = dF.select("col0","col1","col2","col3","col4")
          .groupBy("col3","col4")
          .agg(count("*").as("counter"))

// Java
DataFrame grouped = dF.select("col0","col1","col2","col3","col4")
          .groupBy("col3","col4")
          .agg(count("*").as("counter"))

grouped is going to have this as result:

+-----+----+-------+
| col3|col4|counter|
+-----+----+-------+
|cat13|22.1|      2|
|cat26|30.9|      3|
|cat15|15.3|      1|
+-----+----+-------+

Then, we must join dF and grouped, using their common columns (col3, col4) as the keys for the join (Seq is a Scala class, but you can load it into Java like this. If that throws an error, use the workaround seen here). The select method after the join is to preserve the correct queue of the columns, since join naturally puts col3 and col4 at the start of the schema.

// Scala
dF.join(grouped, Seq("col3", "col4"))
  .select("col0","col1","col2","col3","col4", "counter")

// Java
dF.join(grouped, dF.col("col3").equalTo(grouped.col("col3")).and(dF.col("col4").equalTo(grouped.col("col4")))
  .select("col0","col1","col2","col3","col4", "counter")

And the output looks like this:

+----+----+----+-----+----+-------+
|col0|col1|col2| col3|col4|counter|
+----+----+----+-----+----+-------+
|   0|  10| 100|cat26|30.9|      3|
|   1|  20| 200|cat13|22.1|      2|
|   2|  30| 300|cat26|30.9|      3|
|   3|  40| 400|cat26|30.9|      3|
|   4|  50| 500|cat15|15.3|      1|
|   5|  60| 600|cat13|22.1|      2|
+----+----+----+-----+----+-------+

`dF.join(grouped, Seq("col3", "col4")) .select("col0","col1","col2","col3","col4", "counter")` this line is in JAVA ? if not could you provide it in JAVA ? — Martin Moore, Apr 24 '21 at 14:40
Updated with the workaround for Java from here: https://stackoverflow.com/questions/35211993/how-to-join-multiple-columns-in-spark-sql-using-java-for-filtering-in-dataframe/35212682#35212682 — Coursal, Apr 24 '21 at 14:43

How to calculate distinct rows frequencies in dataset

1 Answers1