multiple conditions for filter in spark data frames

Question

I have a data frame with four fields. one of the field name is Status and i am trying to use a OR condition in .filter for a dataframe . I tried below queries but no luck.

df2 = df1.filter(("Status=2") || ("Status =3"))

df2 = df1.filter("Status=2" || "Status =3")

Has anyone used this before. I have seen a similar question on stack overflow here . They have used below code for using OR condition. But that code is for pyspark.

from pyspark.sql.functions import col 

numeric_filtered = df.where(
(col('LOW')    != 'null') | 
(col('NORMAL') != 'null') |
(col('HIGH')   != 'null'))
numeric_filtered.show()

Judging by this line: `scala> from pyspark.sql.column import Column` it seems like you're trying to use pyspark code when you're actually using scala — Ton Torres, Mar 09 '16 at 01:50
@TonTorres Yes it was a mistake i realized that after posting this question. making that edit now. — dheee, Mar 09 '16 at 18:05

score 62 · Accepted Answer · answered Mar 09 '16 at 02:56

62

Instead of:

df2 = df1.filter("Status=2" || "Status =3")

Try:

df2 = df1.filter($"Status" === 2 || $"Status" === 3)

answered Mar 09 '16 at 02:56

David Griffin

13,677
5
47
65

11

The opposite of `===` is `=!=` – Boern Nov 25 '16 at 14:46
5

Depends on version -- for pre-2.0, use `!==` but after version 2.0.0 `!==` does not have the same precedence as `===`, use `=!=` instead – David Griffin Nov 25 '16 at 15:09
How to filter DF on multiple columns in Java. I have something like this `df.filter(df.col("name").equalTo("john"))`. I want to filter on multiple columns in a single line? – Omkar Puttagunta Mar 28 '17 at 20:02
2

You can just add another .filter after your current one. df.filter(df.col("name").equalTo("john")).filter(df.col("name").equalTo("tim")) – David Schuler Apr 19 '17 at 21:42
2

@DavidSchuler Do those chained filters get merged into a single worker stage in the Spark Analyzer? – WestCoastProjects Jul 27 '17 at 14:55
@DavidGriffin how to add "is null" condition with OR(||) here? – Vicky Apr 16 '20 at 14:27

score 22 · Answer 2 · answered May 13 '18 at 16:58

This question has been answered but for future reference, I would like to mention that, in the context of this question, the where and filter methods in Dataset/Dataframe supports two syntaxes: The SQL string parameters:

df2 = df1.filter(("Status = 2 or Status = 3"))

and Col based parameters (mentioned by @David ):

df2 = df1.filter($"Status" === 2 || $"Status" === 3)

It seems the OP'd combined these two syntaxes. Personally, I prefer the first syntax because it's cleaner and more generic.

score 4 · Answer 3 · answered Jan 17 '19 at 17:19

4

In spark/scala, it's pretty easy to filter with varargs.

val d = spark.read...//data contains column named matid
val ids = Seq("BNBEL0608AH", "BNBEL00608H")
val filtered = d.filter($"matid".isin(ids:_*))

answered Jan 17 '19 at 17:19

Tony Fraser

727
7
14

score 2 · Answer 4 · answered Jun 18 '19 at 03:51

2

df2 = df1.filter("Status = 2 OR Status = 3")

Worked for me.

answered Jun 18 '19 at 03:51

Mr. Simple

21
4

score 1 · Answer 5 · answered Sep 23 '17 at 04:06

You need to use filter

package dataframe

import org.apache.spark.sql.SparkSession
/**
 * @author vaquar.khan@gmail.com
 */
//

object DataFrameExample{
  //
  case class Employee(id: Integer, name: String, address: String, salary: Double, state: String,zip:Integer)
  //
  def main(args: Array[String]) {
    val spark =
      SparkSession.builder()
        .appName("DataFrame-Basic")
        .master("local[4]")
        .getOrCreate()

    import spark.implicits._

    // create a sequence of case class objects 

    // (we defined the case class above)

    val emp = Seq( 
    Employee(1, "vaquar khan", "111 algoinquin road chicago", 120000.00, "AZ",60173),
    Employee(2, "Firdos Pasha", "1300 algoinquin road chicago", 2500000.00, "IL",50112),
    Employee(3, "Zidan khan", "112 apt abcd timesqure NY", 50000.00, "NY",55490),
    Employee(4, "Anwars khan", "washington dc", 120000.00, "VA",33245),
    Employee(5, "Deepak sharma ", "rolling edows schumburg", 990090.00, "IL",60172),
    Employee(6, "afaq khan", "saeed colony Bhopal", 1000000.00, "AZ",60173)
    )

    val employee=spark.sparkContext.parallelize(emp, 4).toDF()

     employee.printSchema()

    employee.show()


    employee.select("state", "zip").show()

    println("*** use filter() to choose rows")

    employee.filter($"state".equalTo("IL")).show()

    println("*** multi contidtion in filer || ")

    employee.filter($"state".equalTo("IL") || $"state".equalTo("AZ")).show()

    println("*** multi contidtion in filer &&  ")

    employee.filter($"state".equalTo("AZ") && $"zip".equalTo("60173")).show()

  }
}

score 1 · Answer 6 · edited Aug 10 '21 at 14:40

1

In java spark dataset it can be used as

Dataset userfilter = user.filter(col("gender").isin("male","female"));

edited Aug 10 '21 at 14:40

sumitya

2,631
1
19
32

answered Apr 24 '19 at 09:50

Dileep Dominic

499
11
23

score 1 · Answer 7 · edited Jan 16 '20 at 11:40

1

Another way is to use function expr with where clause

import org.apache.spark.sql.functions.expr

df2 = df1.where(expr("col1 = 'value1' and col2 = 'value2'"))

It works the same.

edited Jan 16 '20 at 11:40

llinvokerl

1,029
10
25

answered Jan 16 '20 at 10:17

Anurag Sharma

11
1

score 1 · Answer 8 · answered May 26 '20 at 20:43

You can try, (filtering with 1 object like a list or a set of values)

ds = ds.filter(functions.col(COL_NAME).isin(myList));

or as @Tony Fraser suggested, you can try, (with a Seq of objects)

ds = ds.filter(functions.col(COL_NAME).isin(mySeq));

All the answers are correct but most of them do not represent a good coding style. Also, you should always consider the variable length of arguments for the future, even though they are static at a certain point in time.

I think you need to write as `myList:_*` to convert to varargs, or use `isInCollection` — Joshua Chen, Aug 23 '21 at 17:48

Abhishek Sengupta · Answer 9 · 2020-06-10T19:56:40.670

1

For future references : we can use isInCollection to filter ,here is a example : Note : It will look for exact match

  def getSelectedTablesRows(allTablesInfoDF: DataFrame, tableNames: Seq[String]): DataFrame = {

    allTablesInfoDF.where(col("table_name").isInCollection(tableNames))

  }

edited Jun 10 '20 at 19:56

answered Jun 10 '20 at 18:12

Abhishek Sengupta

2,938
1
28
35

score 1 · Answer 10 · answered Jun 10 '20 at 19:55

If we want partial match just like contains, we can chain the contain call like this :

def getSelectedTablesRows2(allTablesInfoDF: DataFrame, tableNames: Seq[String]): DataFrame = {

    val tableFilters = tableNames.map(_.toLowerCase()).map(name => lower(col("table_name")).contains(name))
    val finalFilter = tableFilters.fold(lit(false))((accu, newTableFilter) => accu or newTableFilter)
    allTablesInfoDF.where(finalFilter)

  }

score -3 · Answer 11 · edited May 14 '19 at 17:19

-3

df2 = df1.filter("Status=2")
     .filter("Status=3");

edited May 14 '19 at 17:19

Alex Myers

6,196
7
23
39

answered May 14 '19 at 13:36

pheww

163
2
2
8

2

this is a AND filter not a OR – biobirdman Aug 12 '19 at 21:02
This condition first filters the dataset where Status=2 and then filters the resulting dataset where Status=3, and hence it is AND condition and not OR – J. P May 26 '20 at 20:45

multiple conditions for filter in spark data frames

11 Answers11

Linked

Related