0

I want to test a method we have that is formatted something like this:

def extractTable( spark: SparkSession, /* unrelated other parameters */ ): DataFrame = {
  // Code before that I want to test
  val df = spark.read
    .format("jdbc")
    .option("url", "URL")
    .option("driver", "<Driver>")
    .option("fetchsize", "1000")
    .option("dbtable", "select * from whatever")
    .load()
  // Code after that I want to test 
}  

And I am trying to make stubs of the spark object, and the DataFrameReader objects that the read and option methods return:

val sparkStub = stub[ SparkSession ]
val dataFrameReaderStub = stub[ DataFrameReader ]

( dataFrameReaderStub.format _).when(*).returning( dataFrameReaderStub ) // Works
( dataFrameReaderStub.option _).when(*, *).returning( dataFrameReaderStub ) // Error
( dataFrameReaderStub.load _).when(*).returning( ??? ) // Return a dataframe // Error

( sparkStub.read _).when().returning( dataFrameReaderStub )  

But I am getting an error on dataFrameReaderStub.option and dataFrameReaderStub.load that says "Cannot resolve symbol option" and "Cannot resolve symbol load". But these methods definitely exist on the object that spark.read returns.

How can I resolve this error, or is there a better way to mock/test the code I have?

Jared DuPont
  • 165
  • 2
  • 14

1 Answers1

0

I would suggest you look at this library for testing Spark code: https://github.com/holdenk/spark-testing-base

Mix in this with your test suite: https://github.com/holdenk/spark-testing-base/wiki/SharedSparkContext ...or alternatively, spin up your own SparkSession with a local[2] master. and load the test data from csv/parquet/json.

Mocking Spark classes will be quite painful and probably not a success. I am speaking from experience here, both working for a long time with Spark, and maintaining ScalaMock as a library.

You are better off using Spark in your tests, but not against the real datasources. Instead, load the test data from csv/parquet/json, or programatically generate it (if it contains timestamps and such).

Philipp
  • 967
  • 6
  • 16
  • I don't disagree that scalaMock is lacking, but I am not really sure how that will help. I have a method that has a bunch of code before and after a spark.read command, and that is the code I want to test. For example, the code generates an oracle query that it uses to extract the data and then passes it to the spark.read, how can I inject my own spark.read in the middle of a method – Jared DuPont Apr 21 '20 at 13:41
  • you probably need a different way of abstracting in your code then, e.g. a layered mix-in (or DI injected) trait that you can replace in the test code – Philipp Apr 22 '20 at 16:34