27

Can anyone explain to me why I am getting different results for these 2 expressions ? I am trying to filter between 2 dates:

df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")\
  .select("col1","col2").distinct().count()

Result : 37M

vs

df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")\
  .select("col1","col2").distinct().count()

Result: 25M

How are they different ? It seems to me like they should produce the same result

Community
  • 1
  • 1
femibyte
  • 3,317
  • 7
  • 34
  • 59

2 Answers2

59

TL;DR To pass multiple conditions to filter or where use Column objects and logical operators (&, |, ~). See Pyspark: multiple conditions in when clause.

df.filter((col("act_date") >= "2016-10-01") & (col("act_date") <= "2017-04-01"))

You can also use a single SQL string:

df.filter("act_date >='2016-10-01' AND act_date <='2017-04-01'")

In practice it makes more sense to use between:

df.filter(col("act_date").between("2016-10-01", "2017-04-01"))
df.filter("act_date BETWEEN '2016-10-01' AND '2017-04-01'")

The first approach is not even remote valid. In Python, and returns:

  • The last element if all expressions are "truthy".
  • The first "falsey" element otherwise.

As a result

"act_date <='2017-04-01'" and "act_date >='2016-10-01'"

is evaluated to (any non-empty string is truthy):

"act_date >='2016-10-01'"
Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
1

In first case

df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")\
  .select("col1","col2").distinct().count()

the result is values more than 2016-10-01 that means all the values above 2017-04-01 also.

Whereas in second case

df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")\
  .select("col1","col2").distinct().count()

the result is the values between 2016-10-01 to 2017-04-01.

Ash Man
  • 11
  • 3
  • 1
    I get this error when adding multiple conditions to the filter Py4JError: An error occurred while calling o355.or. Trace: py4j.Py4JException: Method or([class java.lang.String]) does not exist – Lance May 16 '19 at 18:31