You clarifications are not quite clear, but I believe you are trying to add a column to dF
with the count occurrences of the rows after a groupBy
based on columns col3
and col4
.
Let's say we have the following DataFrame:
+----+----+----+-----+----+
|col0|col1|col2| col3|col4|
+----+----+----+-----+----+
| 0| 10| 100|cat26|30.9|
| 1| 20| 200|cat13|22.1|
| 2| 30| 300|cat26|30.9|
| 3| 40| 400|cat26|30.9|
| 4| 50| 500|cat15|15.3|
| 5| 60| 600|cat13|22.1|
+----+----+----+-----+----+
First we need to group and aggregate the rows per col3
and col4
as werner implied in a DataFrame
named grouped
.
// Scala
val grouped = dF.select("col0","col1","col2","col3","col4")
.groupBy("col3","col4")
.agg(count("*").as("counter"))
// Java
DataFrame grouped = dF.select("col0","col1","col2","col3","col4")
.groupBy("col3","col4")
.agg(count("*").as("counter"))
grouped
is going to have this as result:
+-----+----+-------+
| col3|col4|counter|
+-----+----+-------+
|cat13|22.1| 2|
|cat26|30.9| 3|
|cat15|15.3| 1|
+-----+----+-------+
Then, we must join dF
and grouped
, using their common columns (col3
, col4
) as the keys for the join (Seq
is a Scala class, but you can load it into Java like this. If that throws an error, use the workaround seen here). The select
method after the join is to preserve the correct queue of the columns, since join
naturally puts col3
and col4
at the start of the schema.
// Scala
dF.join(grouped, Seq("col3", "col4"))
.select("col0","col1","col2","col3","col4", "counter")
// Java
dF.join(grouped, dF.col("col3").equalTo(grouped.col("col3")).and(dF.col("col4").equalTo(grouped.col("col4")))
.select("col0","col1","col2","col3","col4", "counter")
And the output looks like this:
+----+----+----+-----+----+-------+
|col0|col1|col2| col3|col4|counter|
+----+----+----+-----+----+-------+
| 0| 10| 100|cat26|30.9| 3|
| 1| 20| 200|cat13|22.1| 2|
| 2| 30| 300|cat26|30.9| 3|
| 3| 40| 400|cat26|30.9| 3|
| 4| 50| 500|cat15|15.3| 1|
| 5| 60| 600|cat13|22.1| 2|
+----+----+----+-----+----+-------+