4

I have a utility function written in Python to write parquet files and json files to s3 bucket. This is the function:

def write_to_s3(data1, data2, s3_path):

    try:
        data1.write.mode("overwrite").parquet(s3_path)
        data2.write.mode("overwrite").json(s3_path, compression="gzip")

    except Exception as err:
        logging.error(err)
        raise

I'm still learning unit test, just wondering if there's a way to mock spark session to avoid setup a real one in the unit tests? Could someone help me in writing unit test cases for this please. I found a similar question but it's for Scala and it needs to set up a Spark session and I thought there is a way to mock it like we can mock s3? Hope this makes sense, thanks.

Update: I have followed this page that @Mauro Baraldi recommended below, that approach works, but it only look at the write operation which been called one, how I can test the parquet & json part to make sure the data is written in s3 with the expected format? Thanks.

wawawa
  • 2,835
  • 6
  • 44
  • 105
  • Take a look [here](https://towardsdatascience.com/stop-mocking-me-unit-tests-in-pyspark-using-pythons-mock-library-a4b5cd019d7e) – Mauro Baraldi Apr 23 '21 at 17:32
  • @MauroBaraldi Hi I've followed this page and had a try, but it gave me an error `AssertionError: Expected 'mock' to have been called once. Called 0 times.`, I've updated my question, could you please have a look? Many thanks. – wawawa Apr 25 '21 at 14:37
  • I don't know anything about Spark. Which type of object is `data1` and `data2`? I believe that you must mock the object of `data1' and `data2`. – Mauro Baraldi Apr 26 '21 at 01:49
  • @MauroBaraldi `data1` and `data2` are both dataframe used by pyspark...from the article, seems like it can be mocked by `mock_data1 = mock.Mock()` but it doesn't work somehow... – wawawa Apr 26 '21 at 08:10

0 Answers0