How to unit test a utility function for Python pyspark

Question

I have a utility function written in Python to write parquet files and json files to s3 bucket. This is the function:

def write_to_s3(data1, data2, s3_path):

    try:
        data1.write.mode("overwrite").parquet(s3_path)
        data2.write.mode("overwrite").json(s3_path, compression="gzip")

    except Exception as err:
        logging.error(err)
        raise

I'm still learning unit test, just wondering if there's a way to mock spark session to avoid setup a real one in the unit tests? Could someone help me in writing unit test cases for this please. I found a similar question but it's for Scala and it needs to set up a Spark session and I thought there is a way to mock it like we can mock s3? Hope this makes sense, thanks.

Update: I have followed this page that @Mauro Baraldi recommended below, that approach works, but it only look at the write operation which been called one, how I can test the parquet & json part to make sure the data is written in s3 with the expected format? Thanks.

Take a look [here](https://towardsdatascience.com/stop-mocking-me-unit-tests-in-pyspark-using-pythons-mock-library-a4b5cd019d7e) — Mauro Baraldi, Apr 23 '21 at 17:32
@MauroBaraldi Hi I've followed this page and had a try, but it gave me an error `AssertionError: Expected 'mock' to have been called once. Called 0 times.`, I've updated my question, could you please have a look? Many thanks. — wawawa, Apr 25 '21 at 14:37
I don't know anything about Spark. Which type of object is `data1` and `data2`? I believe that you must mock the object of `data1' and `data2`. — Mauro Baraldi, Apr 26 '21 at 01:49
@MauroBaraldi `data1` and `data2` are both dataframe used by pyspark...from the article, seems like it can be mocked by `mock_data1 = mock.Mock()` but it doesn't work somehow... — wawawa, Apr 26 '21 at 08:10

How to unit test a utility function for Python pyspark

0 Answers0