0

I am failing to achieve simple SQL query in Spark.

I would like to write the below query in Scala Spark:

select * from emp where emp_id in (select distinct manager_id from emp ;

Below is what I tried:

empdf.where(col("emp_id").isin(empdf.select("manager_id").collect().map(_(0)).toList)).show()

I got the below error:

java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon List(null, 68319, 68319, 68319, 65646, 65646, 69062, 66928, 66928, 66928, 66928, 67858, 66928, 67832)

mck
  • 40,932
  • 13
  • 35
  • 50

2 Answers2

1

It's better to do a semi join to avoid collecting as list:

empdf.alias("t1").join(empdf.alias("t2"), expr("t1.emp_id = t2.manager_id"), "left_semi")

If you want to use isin, you can expand the List using : _* (see this post):

empdf.where(col("emp_id").isin(empdf.select("manager_id").collect().map(_(0)).toList: _*)).show()

Or use isInCollection:

empdf.where(col("emp_id").isInCollection(empdf.select("manager_id").collect().map(_(0)).toList)).show()
mck
  • 40,932
  • 13
  • 35
  • 50
0

You can also write as a SQL,

empdf.createorreplaceview("emp")

spark.sql("select * from emp where emp_id in (select distinct manager_id from emp ")