How to mock this PySpark udf?

Question

My dataframe :

My udf is below as:

@udf(returnType=StringType())
def clean_email(email): 
    try:
        regex = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' 
        replace={"%20":"" ,  "//":"" ,"/":""}
        for i in replace:
             if i in email:
                 email=email.replace(i,"")
        if email is not None or '.jpg' in email or email.startswith('http'):    
            if email.endswith('.') :
                email=email[:len(email)-1]      
            return ''.join(e for e in email if (e.isalnum() or e in ['.', '@','-','_']))                       
        else:
            return ""
    except Exception as x:
         print("Error Occured in email udf, Error: " + str(x))

Below code is used to compare 'expected' and 'curated' column

My program:

df1=context.spark.read.option("header",True).csv("./test/input/11-udf-test/Book1.csv",schema=schema) 
        df3=sorted(df1.select(col("expected"))).collect()  
        df2=df1.withColumn("Curated", dataclean.clean_email(col("email")))
        df4=sorted(df2.select(col("Curated"))).collect()
        assert df3== df4

Error:

test/udf_test.py:32: AssertionError
    ====================================================== short test summary info =======================================================
    FAILED test/udf_test.py::test_upper - AssertionError: assert [Row(expected...xpected=None)] == [Row(Curated=...t@gmail.com')]
    ==================================================== 1 failed in

Why not just write ordinary python unit tests for it as an ordinary python method? The fact that it's _also_ registered as a Spark UDF shouldn't matter. — David Browne - Microsoft, Sep 04 '22 at 23:28
Examples here: https://docs.python.org/3/library/unittest.html — David Browne - Microsoft, Sep 04 '22 at 23:46
I have written a small program as suggested , but getting an error,I have update the question can you please check and give me suggestion — Xi12, Sep 05 '22 at 00:31

ZygD · Accepted Answer · 2022-09-06T08:27:14.880

The main thing that you didn't grasp is that Spark column (i.e. the result of the udf) does not really "exist" until you add it to a dataframe and do an action with the dataframe. In your case, you don't have a dataframe, so you don't really run your udf.

You could

create a dataframe (spark.range(1))
add a literal column (.select ... lit('komu11.009@gmail.com'))
run the clean_email udf on the column
extract the column's first row's value as string (.head()[0])
and only then compare it to your expected result

from pyspark.sql import functions as F
assert spark.range(1).select(clean_email(F.lit('komu11.009@gmail.com'))).head()[0] == 'komu.009@gmail.com'
# False

Edit - since you showed how your dataframe looks.

Don't create separate dataframes for columns "ecpected" and "Curated", as you don't know if the order in both dataframes will be preserved. You could do something along these lines:

df1 = spark.createDataFrame(
    [('%20komu.009@gmail.com', 'komu.009@gmail.com'),
     ('test@gmail.com.', 'test@gmail.com'),
     ('//brackets@telligen.com', 'brackets@telligen.com'),
     ('httpsgmail.com', '')],
    ['email', 'expected'])
df2 = df1.withColumn("Curated", clean_email("email"))
for row in df2.collect():
    assert row['expected'] == row['Curated']

AssertionError

You get the assertion error, because httpsgmail.com is not changed by your udf logic., i.e. "httpsgmail.com" != ""

How to mock this PySpark udf?

1 Answers1