We are performing a POC to run a Spark Structured Streaming on GKE (using spark-operator) and we plan to store our checkpoints in GCS.
From the GCS documentation, it seems that having the storage bucket within the same location as GKE with Location type as Region
, Access Control as Uniform
and Storage class as Standard
is how we should setup the bucket.
My question is for people / teams who have implemented Spark checkpointing in GCS that if these setting are good decisions and how has been their experience in terms of performance.
For a few thousand rows / second input size this setting works well for us and wanted to get some real examples if any before we provision more resources and commit to deploying our Spark application in GKE.
Currently we run our application using yarn and we want to move to GKE.
Spark version we are trying POC on 3.3.1