I have written some code in python with sql context i.e pyspark to perform some operations on csv by converting them into pyspark dataframes(df operations such as pre-processing,renaming column names,creating new column and appending them to same dataframe and so on). I wish to write unit test cases for it. I have no idea of writing unit testcases on dataframes. Can anyone help me out how to write unit testcases on dataframes in pyspark? Or give me some sort of sources for testcases on dataframes?
Asked
Active
Viewed 3,946 times
1 Answers
3
Dataframes are not different from anything else in pyspark land. You can start by looking at Python section of spark-testing-base. There are several interesting projects that have dataframe tests, so you can start at peeking how they do it: Sparkling Pandas is one, and here is another example. There is also find-spark that will help locate your spark executable context. But the basic idea is to setup path properly before you start your test:
def add_pyspark_path():
"""
Add PySpark to the PYTHONPATH
Thanks go to this project: https://github.com/holdenk/sparklingpandas
"""
import sys
import os
try:
sys.path.append(os.path.join(os.environ['SPARK_HOME'], "python"))
sys.path.append(os.path.join(os.environ['SPARK_HOME'],
"python","lib","py4j-0.9-src.zip"))
except KeyError:
print "SPARK_HOME not set"
sys.exit(1)
add_pyspark_path() # Now we can import pyspark
And normally you would have one base test case class:
import logging
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext, HiveContext
def quiet_py4j():
""" turn down spark logging for the test context """
logger = logging.getLogger('py4j')
logger.setLevel(logging.WARN)
class SparkTestCase(unittest.TestCase):
@classmethod
def setUpClass(cls):
quiet_py4j()
# Setup a new spark context for each test
conf = SparkConf()
conf.set("spark.executor.memory","1g")
conf.set("spark.cores.max", "1")
#conf.set("spark.master", "spark://192.168.1.2:7077")
conf.set("spark.app.name", "nosetest")
cls.sc = SparkContext(conf=conf)
cls.sqlContext = HiveContext(cls.sc)
@classmethod
def tearDownClass(cls):
cls.sc.stop()

Oleksiy
- 6,337
- 5
- 41
- 58
-
1could you please have a look at `https://stackoverflow.com/questions/49420660/unit-test-pyspark-code-using-python` – User12345 Mar 22 '18 at 05:23