My dataframe :
My udf is below as:
@udf(returnType=StringType())
def clean_email(email):
try:
regex = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
replace={"%20":"" , "//":"" ,"/":""}
for i in replace:
if i in email:
email=email.replace(i,"")
if email is not None or '.jpg' in email or email.startswith('http'):
if email.endswith('.') :
email=email[:len(email)-1]
return ''.join(e for e in email if (e.isalnum() or e in ['.', '@','-','_']))
else:
return ""
except Exception as x:
print("Error Occured in email udf, Error: " + str(x))
Below code is used to compare 'expected' and 'curated' column
My program:
df1=context.spark.read.option("header",True).csv("./test/input/11-udf-test/Book1.csv",schema=schema)
df3=sorted(df1.select(col("expected"))).collect()
df2=df1.withColumn("Curated", dataclean.clean_email(col("email")))
df4=sorted(df2.select(col("Curated"))).collect()
assert df3== df4
Error:
test/udf_test.py:32: AssertionError
====================================================== short test summary info =======================================================
FAILED test/udf_test.py::test_upper - AssertionError: assert [Row(expected...xpected=None)] == [Row(Curated=...t@gmail.com')]
==================================================== 1 failed in