82
val items = List("a", "b", "c")

sqlContext.sql("select c1 from table")
          .filter($"c1".isin(items))
          .collect
          .foreach(println)

The code above throws the following exception.

Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon List(a, b, c) 
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:49)
at org.apache.spark.sql.functions$.lit(functions.scala:89)
at org.apache.spark.sql.Column$$anonfun$isin$1.apply(Column.scala:642)
at org.apache.spark.sql.Column$$anonfun$isin$1.apply(Column.scala:642)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.Column.isin(Column.scala:642)

Below is my attempt to fix it. It compiles and runs but doesn't return any match. Not sure why.

val items = List("a", "b", "c").mkString("\"","\",\"","\"")

sqlContext.sql("select c1 from table")
          .filter($"c1".isin(items))
          .collect
          .foreach(println)
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Nabegh
  • 3,249
  • 6
  • 25
  • 26

5 Answers5

154

According to documentation, isin takes a vararg, not a list. List is actually a confusing name here. You can try converting your List to vararg like this:

val items = List("a", "b", "c")

sqlContext.sql("select c1 from table")
          .filter($"c1".isin(items:_*))
          .collect
          .foreach(println)

Your variant with mkString compiles, because one single String is also a vararg (with number of arguments equal to 1), but it is proably not what you want to achieve.

TheMP
  • 8,257
  • 9
  • 44
  • 73
19

It worked like this in Java Api (Java 8)

.isin(sampleListName.stream().toArray(String[]::new))));

sampleListName is a List

Anandkumar
  • 1,338
  • 13
  • 15
15

Spark has now (since 2.4.0) a method called isInCollection, which is just what you are looking for, instead of isIn.

(shouldn't they unify the methods?)

Lucas Lima
  • 832
  • 11
  • 23
  • 2
    Be aware this method is implemented via: `def isInCollection(values: scala.collection.Iterable[_]): Column = isin(values.toSeq: _*)` (cf: [definition](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L813) ). I don't really know why the API has two different names for similar behaviours. – BlueSheepToken Jul 13 '21 at 16:12
  • 1
    That makes me even more confused. I don't really understand. I'd have expected an overloaded method to handle both types of inputs, just like many others in the API. Thank you for the info! – Lucas Lima Jul 13 '21 at 16:47
  • This [PR](https://github.com/apache/spark/commit/900bc1f7dc5a7c013b473fceab1c4052ade74a2f) introduces the method without explanation :/ – BlueSheepToken Jul 15 '21 at 10:00
  • 2
    @BlueSheepToken The discussion is on [this](https://github.com/apache/spark/pull/21416) PR. The reason it's not overloaded is that `isin` is defined to take `Any*`, and `Iterable[_]` can be an `Any*` leading to a conflict. – Lexi Tramel Aug 11 '21 at 17:05
  • I don't know how do you folks find this kind of stuff, but my random tangent question was fully answered 20 months after I let it out to the world. Thank you! – Lucas Lima Aug 12 '21 at 19:39
2

As Tomalak has mentioned it :

isin(java.lang.Object... list)
A boolean expression that is evaluated to true if the value 
of this expression is contained by the evaluated values of the arguments.

Therefore, you just could fix this making the following change :

val items = List("a", "b", "c").map(c => s""""$c"""")
Francis Toth
  • 1,595
  • 1
  • 11
  • 23
  • 1
    Why the map? My is that `.filter($"c1".isin(List("a", "b", "c")))` would work. – Tomalak Sep 13 '15 at 17:00
  • Considering your code `List("a", "b", "c").mkString("\"","\",\"","\"")`, I assumed that you wanted to surround each item by double quotes. The map does the same thing. – Francis Toth Sep 13 '15 at 17:10
-1

Even easier:

sqlContext.sql("select c1 from table")
          .filter($"c1".isin("a", "b", "c"))
          .collect
          .foreach(println)

Unless you have a lot of list values, which isn't the case usually.

pedromorfeu
  • 1,739
  • 14
  • 18