I am having a PySpark
DataFrame, and I wish to keep only those IDs
which begin with 5 or 6 AND have length 5. I have taken IDs
as strings.
df = sqlContext.createDataFrame([('51087',),('61211',),('09112',),('521',),],("ID",))
df.show()
+-----+
| ID|
+-----+
|51087|
|61211|
|09112|
| 521|
+-----+
df=df.where((col("ID").like('5%') | col("ID").like('6%')) & (length(col("ID"))==5))
df.show()
#I get the right answer.
+-----+
| ID|
+-----+
|51087|
|61211|
+-----+
But, if I use the following code, which involves both SQL (former) and DataFrame (later) styled conditions, I get wrong result -
df=df.where(("ID like '5%' or ID like '6%'") and (length(col("ID"))==5))
df.show()
#Wrong result
+-----+
| ID|
+-----+
|51087|
|61211|
|09112|
+-----+
Not only that, if I used OR operator instead of AND operator, I again get wrong result.
df=df.where(("ID like '5%' or ID like '6%'") or (length(col("ID"))==5))
df.show()
#Again wrong result
+-----+
| ID|
+-----+
|51087|
|61211|
| 521|
+-----+
Now, this is just counter intuitive. The result which I expect with AND operator, I get with OR operator and vice versa. Let me construct a truth table of both conditions respectively. Can anyone explain this fallacy?
Truth Table
condition1 condition2 AND OR
True True True True
True True True True
False True False True
True False False True