My problem: I am developing a Spark extension and I would like to perform tests and performance at scale before making the changes public. Currently such tests are a bit too manual: I compile & package my libraries, copy the jar files to a cluster where I have a private Spark deployment, restart Spark, then fire tests and benchmarks by hand. After each test I manually inspect logs and console output.
Could someone with more experience offer hints on how to make this more automatic? I am particularly interested in:
Ability to integrate with Github & Jenkins. Ideally I would only have to push a commit to the GitHub repo, then Jenkins would automatically pull and build, add the new libraries to a Spark environment, start Spark & trigger the tests and benchmarks, and finally collect & make output files available.
How to run and manage the Spark cluster. I see a number of options:
a) continue with having a single Spark installation: The test framework would update my jar files, restart Spark so the new libraries are picked up and then run the tests/benchmarks. The advantage would be that I only have to set up Spark (and maybe HDFS for sharing data & application binaries, YARN as the resource manager, etc) once.
b) run Spark in containers: My cluster would run a container management system (like Kubernetes). The test framework would create/update the Spark container image, fire up & configure a number of containers to start Spark, submit the test/benchmarks and collect results. The big advantage of this is that multiple developers can run tests in parallel and that I can test various versions of Spark & Hadoop.