0

I have 2 Dataframe, df1, and df2:

df1:

+-------------------+----------+------------+
|         df1.name  |df1.state | df1.pincode|
+-------------------+----------+------------+
|  CYBEX INTERNATION| HOUSTON  | 00530      |
|        FLUID POWER| MEDWAY   | 02053      |
|   REFINERY SYSTEMS| FRANCE   | 072234     |
|    K N ENTERPRISES| MUMBAI   | 100010     |
+-------------------+----------+------------+

df2:

+--------------------+------------+------------+
|           df2.name |df2.state   | df2.pincode|
+--------------------+------------+------------+
|FLUID POWER PVT LTD | MEDWAY     | 02053      |
|  CYBEX INTERNATION | HOUSTON    | 02356      |
|REFINERY SYSTEMS LTD| MUMBAI     | 072234     |
+--------------------+------------+------------+

My work is to validate whether the data in df1 is present on df2, if it does validate = 1 else validate = 0. Now I am running some join operation on the condition, state, and Pincode and for string compare I am first converting a string to lower case, sorting and using Python Sequence matching. Expected Output is:

+-------------------+-------------------+----------+------------+------------+
|           df1.name|df2.name           |df1.state | df1.pincode|  Validated |
+-------------------+-------------------+----------+------------+------------+
|  CYBEX INTERNATION| NULL              |HOUSTON   | 00530      |     0      |
|        FLUID POWER|FLUID POWER PVT LTD|MEDWAY    | 02053      |     1      |
|   REFINERY SYSTEMS| NULL              |FRANCE    | 072234     |     0      |
|    K N ENTERPRISES| NULL              |MUMBAI    | 100010     |     0      |
+-------------------+-------------------+----------+------------+------------+

I have my code:

from pyspark.sql.types import *
from difflib import SequenceMatcher
from pyspark.sql.functions import col,when,lit,udf

contains = udf(lambda s, q: SequenceMatcher(None,"".join(sorted(s.lower())), "".join(sorted(q.lower()))).ratio()>=0.9, BooleanType())
join_condition = ((col("df1.pincode") == col("df2.pincode")) & (col("df1.state") == col("df2.state")))
result_df = df1.alias("df1").join(df2.alias("df2"), join_condition , "left").where(contains(col("df1.name"), col("df2.name")))
result = result_df.select("df1.*",when(col("df2.name").isNotNull(), lit(1)).otherwise(lit(0)).alias("validated"))
result.show()

But the output is giving me AttributeError: 'NoneType' object has no attribute 'lower' I know the unmatched column is Null so that's why s.lower() and p.lower() not working, but how to tackle this problem. I want only this condition in contains, to do filter process.

Also, I need to have df2.name column in result for that I am giving col names in list:

cols = ["df1.name","df2.name","df1.state","df1.pincode"]
result = result_df.select(*cols,when(col("df2.name").isNotNull(), lit(1)).otherwise(lit(0)).alias("validated"))

But again I am getting an error: SyntaxError: only named arguments may follow *expression

Any help will be appreciated. Thanks.

Mohammad Rijwan
  • 335
  • 3
  • 17
  • Please provide a reproducible example. Take a moment to read about how to post spark questions: https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples – YOLO Feb 24 '20 at 13:26
  • Hi @YOLO I have updated my problem statement and codes also. I am getting the result if I apply filter i.e result.filter("validate>0").show() but for result.show() I am getting error. – Mohammad Rijwan Feb 24 '20 at 13:35

1 Answers1

0

in your UDF, you are using the .lower method. This method is a method of str objects. Apparently, in your Dataframe, you have somewhere in the df1.name or df2.name some None values.

Replace your current UDF with something like this to handle None :

contains = udf(
    lambda s, q: SequenceMatcher(
        None,
        "".join(sorted((s or "").lower())), 
        "".join(sorted((q or "").lower()))
    ).ratio()>=0.9, BooleanType()
)
Steven
  • 14,048
  • 6
  • 38
  • 73