1

My problem: I am developing a Spark extension and I would like to perform tests and performance at scale before making the changes public. Currently such tests are a bit too manual: I compile & package my libraries, copy the jar files to a cluster where I have a private Spark deployment, restart Spark, then fire tests and benchmarks by hand. After each test I manually inspect logs and console output.

Could someone with more experience offer hints on how to make this more automatic? I am particularly interested in:

  • Ability to integrate with Github & Jenkins. Ideally I would only have to push a commit to the GitHub repo, then Jenkins would automatically pull and build, add the new libraries to a Spark environment, start Spark & trigger the tests and benchmarks, and finally collect & make output files available.

  • How to run and manage the Spark cluster. I see a number of options:

    a) continue with having a single Spark installation: The test framework would update my jar files, restart Spark so the new libraries are picked up and then run the tests/benchmarks. The advantage would be that I only have to set up Spark (and maybe HDFS for sharing data & application binaries, YARN as the resource manager, etc) once.

    b) run Spark in containers: My cluster would run a container management system (like Kubernetes). The test framework would create/update the Spark container image, fire up & configure a number of containers to start Spark, submit the test/benchmarks and collect results. The big advantage of this is that multiple developers can run tests in parallel and that I can test various versions of Spark & Hadoop.

Radu
  • 1,098
  • 1
  • 11
  • 22
  • In the past I was able to get away with running simple tests on small samples of data using a single docker image just running spark (no YARN or HDFS). I may do this again since it was easy and fast. – szeitlin Jul 05 '23 at 23:36

1 Answers1

0

Create a Docker container that has your entire solution contained including tests, push it to GitHub and have a DroneCi or Travis CI build it and listen for updates. It works great for me.

There are many Spark docker images on GitHub or Docker hub I use this one:

https://github.com/jupyter/docker-stacks/tree/master/all-spark-notebook