dataframe: how to groupBy/count then filter on count in Scala

Question

Spark 1.4.1

I encounter a situation where grouping by a dataframe, then counting and filtering on the 'count' column raises the exception below

import sqlContext.implicits._
import org.apache.spark.sql._

case class Paf(x:Int)
val myData = Seq(Paf(2), Paf(1), Paf(2))
val df = sc.parallelize(myData, 2).toDF()

Then grouping and filtering:

df.groupBy("x").count()
  .filter("count >= 2")
  .show()

Throws an exception:

java.lang.RuntimeException: [1.7] failure: ``('' expected but `>=' found count >= 2

Solution:

Renaming the column makes the problem vanish (as I suspect there is no conflict with the interpolated 'count' function'

df.groupBy("x").count()
  .withColumnRenamed("count", "n")
  .filter("n >= 2")
  .show()

So, is that a behavior to expect, a bug or is there a canonical way to go around?

thanks, alex

score 55 · Answer 1 · answered Aug 20 '15 at 14:03

55

When you pass a string to the filter function, the string is interpreted as SQL. Count is a SQL keyword and using count as a variable confuses the parser. This is a small bug (you can file a JIRA ticket if you want to).

You can easily avoid this by using a column expression instead of a String:

df.groupBy("x").count()
  .filter($"count" >= 2)
  .show()

answered Aug 20 '15 at 14:03

Herman

1,459
12
5

Why does the filter expression not work if I change it to '==' ? – sqlconsumer.net May 09 '19 at 12:58
@sqlconsumer.net use '===' – Jacob Joy Dec 10 '19 at 15:06

score 32 · Answer 2 · edited Mar 20 '18 at 18:24

32

So, is that a behavior to expect, a bug

Truth be told I am not sure. It looks like parser is interpreting count not as a column name but a function and expects following parentheses. Looks like a bug or at least a serious limitation of the parser.

is there a canonical way to go around?

Some options have been already mentioned by Herman and mattinbits so here more SQLish approach from me:

import org.apache.spark.sql.functions.count

df.groupBy("x").agg(count("*").alias("cnt")).where($"cnt"  > 2)

edited Mar 20 '18 at 18:24

Community

1
1

answered Aug 20 '15 at 14:03

zero323

322,348
103
959
935

how can i show all columns instead of the column X and the CNT col? – Abu Shoeb Aug 01 '18 at 04:38
1

@abu-shoeb You can use `agg(...)` with more than one expression. A common pattern is to use `min(name)` for all the other columns you'd like to show, giving the smallest value of the column in each group. You would have to list all columns explicitly. – DanyalBurke Aug 10 '20 at 05:58

score 12 · Answer 3 · answered Aug 20 '15 at 14:05

12

I think a solution is to put count in back ticks

.filter("`count` >= 2")

http://mail-archives.us.apache.org/mod_mbox/spark-user/201507.mbox/%3C8E43A71610EAA94A9171F8AFCC44E351B48EDF@fmsmsx124.amr.corp.intel.com%3E

answered Aug 20 '15 at 14:05

mattinbits

10,370
1
26
35

solution fits for pyspark also. – Alex Raj Kaliamoorthy Oct 11 '22 at 11:40

dataframe: how to groupBy/count then filter on count in Scala

3 Answers3

Linked