Unit testcases on Pyspark dataframe operations

Question

I have written some code in python with sql context i.e pyspark to perform some operations on csv by converting them into pyspark dataframes(df operations such as pre-processing,renaming column names,creating new column and appending them to same dataframe and so on). I wish to write unit test cases for it. I have no idea of writing unit testcases on dataframes. Can anyone help me out how to write unit testcases on dataframes in pyspark? Or give me some sort of sources for testcases on dataframes?

score 3 · Answer 1 · answered Apr 15 '16 at 17:29

Dataframes are not different from anything else in pyspark land. You can start by looking at Python section of spark-testing-base. There are several interesting projects that have dataframe tests, so you can start at peeking how they do it: Sparkling Pandas is one, and here is another example. There is also find-spark that will help locate your spark executable context. But the basic idea is to setup path properly before you start your test:

def add_pyspark_path():
    """
    Add PySpark to the PYTHONPATH
    Thanks go to this project: https://github.com/holdenk/sparklingpandas
    """
    import sys
    import os
    try:
        sys.path.append(os.path.join(os.environ['SPARK_HOME'], "python"))
        sys.path.append(os.path.join(os.environ['SPARK_HOME'],
            "python","lib","py4j-0.9-src.zip"))
    except KeyError:
        print "SPARK_HOME not set"
        sys.exit(1)

add_pyspark_path() # Now we can import pyspark

And normally you would have one base test case class:

import logging

from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext, HiveContext

def quiet_py4j():
    """ turn down spark logging for the test context """
    logger = logging.getLogger('py4j')
    logger.setLevel(logging.WARN)

class SparkTestCase(unittest.TestCase):
    @classmethod
    def setUpClass(cls):
        quiet_py4j()

        # Setup a new spark context for each test
        conf = SparkConf()
        conf.set("spark.executor.memory","1g")
        conf.set("spark.cores.max", "1")
        #conf.set("spark.master", "spark://192.168.1.2:7077")
        conf.set("spark.app.name", "nosetest")
        cls.sc = SparkContext(conf=conf)
        cls.sqlContext = HiveContext(cls.sc)

    @classmethod
    def tearDownClass(cls):
        cls.sc.stop()

could you please have a look at `https://stackoverflow.com/questions/49420660/unit-test-pyspark-code-using-python` — User12345, Mar 22 '18 at 05:23

Unit testcases on Pyspark dataframe operations

1 Answers1