I have 2 Dataframe, df1, and df2:
df1:
+-------------------+----------+------------+
| df1.name |df1.state | df1.pincode|
+-------------------+----------+------------+
| CYBEX INTERNATION| HOUSTON | 00530 |
| FLUID POWER| MEDWAY | 02053 |
| REFINERY SYSTEMS| FRANCE | 072234 |
| K N ENTERPRISES| MUMBAI | 100010 |
+-------------------+----------+------------+
df2:
+--------------------+------------+------------+
| df2.name |df2.state | df2.pincode|
+--------------------+------------+------------+
|FLUID POWER PVT LTD | MEDWAY | 02053 |
| CYBEX INTERNATION | HOUSTON | 02356 |
|REFINERY SYSTEMS LTD| MUMBAI | 072234 |
+--------------------+------------+------------+
My work is to validate whether the data in df1 is present on df2, if it does validate = 1 else validate = 0. Now I am running some join operation on the condition, state, and Pincode and for string compare I am first converting a string to lower case, sorting and using Python Sequence matching. Expected Output is:
+-------------------+-------------------+----------+------------+------------+
| df1.name|df2.name |df1.state | df1.pincode| Validated |
+-------------------+-------------------+----------+------------+------------+
| CYBEX INTERNATION| NULL |HOUSTON | 00530 | 0 |
| FLUID POWER|FLUID POWER PVT LTD|MEDWAY | 02053 | 1 |
| REFINERY SYSTEMS| NULL |FRANCE | 072234 | 0 |
| K N ENTERPRISES| NULL |MUMBAI | 100010 | 0 |
+-------------------+-------------------+----------+------------+------------+
I have my code:
from pyspark.sql.types import *
from difflib import SequenceMatcher
from pyspark.sql.functions import col,when,lit,udf
contains = udf(lambda s, q: SequenceMatcher(None,"".join(sorted(s.lower())), "".join(sorted(q.lower()))).ratio()>=0.9, BooleanType())
join_condition = ((col("df1.pincode") == col("df2.pincode")) & (col("df1.state") == col("df2.state")))
result_df = df1.alias("df1").join(df2.alias("df2"), join_condition , "left").where(contains(col("df1.name"), col("df2.name")))
result = result_df.select("df1.*",when(col("df2.name").isNotNull(), lit(1)).otherwise(lit(0)).alias("validated"))
result.show()
But the output is giving me AttributeError: 'NoneType' object has no attribute 'lower' I know the unmatched column is Null so that's why s.lower() and p.lower() not working, but how to tackle this problem. I want only this condition in contains, to do filter process.
Also, I need to have df2.name column in result for that I am giving col names in list:
cols = ["df1.name","df2.name","df1.state","df1.pincode"]
result = result_df.select(*cols,when(col("df2.name").isNotNull(), lit(1)).otherwise(lit(0)).alias("validated"))
But again I am getting an error: SyntaxError: only named arguments may follow *expression
Any help will be appreciated. Thanks.