0

I am having a PySpark DataFrame, and I wish to keep only those IDs which begin with 5 or 6 AND have length 5. I have taken IDs as strings.

df = sqlContext.createDataFrame([('51087',),('61211',),('09112',),('521',),],("ID",))
df.show()
+-----+
|   ID|
+-----+
|51087|
|61211|
|09112|
|  521|
+-----+

df=df.where((col("ID").like('5%') | col("ID").like('6%')) & (length(col("ID"))==5))
df.show()
#I get the right answer.
+-----+
|   ID|
+-----+
|51087|
|61211|
+-----+

But, if I use the following code, which involves both SQL (former) and DataFrame (later) styled conditions, I get wrong result -

df=df.where(("ID like '5%' or ID like '6%'") and (length(col("ID"))==5))
df.show()
#Wrong result
+-----+
|   ID|
+-----+
|51087|
|61211|
|09112|
+-----+

Not only that, if I used OR operator instead of AND operator, I again get wrong result.

df=df.where(("ID like '5%' or ID like '6%'") or (length(col("ID"))==5))
df.show()
#Again wrong result
+-----+
|   ID|
+-----+
|51087|
|61211|
|  521|
+-----+

Now, this is just counter intuitive. The result which I expect with AND operator, I get with OR operator and vice versa. Let me construct a truth table of both conditions respectively. Can anyone explain this fallacy?

Truth Table
condition1 condition2    AND    OR
True       True         True    True
True       True         True    True
False      True         False   True
True       False        False   True
cph_sto
  • 7,189
  • 12
  • 42
  • 78
  • 1
    `("ID like '5%' or ID like '6%'") and (length(col("ID"))==5)` does not do what you think it does. You need to `import pyspark.sql.functions.expr` and do something like `where(expr("ID like '5%' or ID like '6%'") &(length(col("ID"))==5))`. (Notice the bitwise operator between the columns (don't use `and` or `or`). – pault Oct 19 '18 at 15:20
  • Oh, then my understanding is fallacious. I must research this issue afresh. Your code works perfectly. Thank you pault for your help. Very appreciated :) – cph_sto Oct 19 '18 at 15:31

0 Answers0