Apache-spark cluster without HDFS?

Question

I'm trying to understand if it is possible to run Spark on a cluster without Hadoop services. It seems like it should be, given pages like standlone spark and spark on mesos, but even that doesn't offer an alternative to HDFS. Is there one? The goal is finding a way to deploy a Spark cluster without signing up to manage a Hadoop cluster, too.

This post implies that you can use NFS instead: https://stackoverflow.com/questions/32542719/using-apache-spark-with-hdfs-vs-other-distributed-storage?rq=2 but you won't get the scaling of HDFS — Wheezil, Jun 17 '23 at 11:38

OneCricketeer · Accepted Answer · 2023-06-17T12:13:45.013

0

NFS, S3 (Minio), GCS, Azure WASB, Databricks DBFS, Ceph all work with Spark.

Spark on Mesos is deprecated according to the note on its page, in favor of Kubernetes, which is even more complicated to manage than Hadoop.

edited Jun 17 '23 at 12:13

answered Jun 17 '23 at 11:51

OneCricketeer

179,855
19
132
245

To use S3 or GCS with Spark, do you have to define the default HadoopFileSystem on the cloud storage using the Hadoop configuration files? – Wheezil Jun 17 '23 at 11:55
Does one even need to use Mesos or K8S? As opposed to "standalone"? – Wheezil Jun 17 '23 at 11:55
Spark still relies on hadoop-client JAR files, so yes, you'd configure external filesystems via Hadoop core-site.xml config, or do it in your Spark app via SparkSession config – OneCricketeer Jun 17 '23 at 11:56
Standalone mode doesn't elastically scale. Depends on your needs, really. – OneCricketeer Jun 17 '23 at 11:57
Meaning... in standalone you have to choose a fixed number of VMs during configuration? Makes sense. FWIW, our devops teams is already neck-deep in K8S, so that's actually a good alternative to Hadoop for us. – Wheezil Jun 17 '23 at 11:58
Standalone mode has designated master/worker instances, yes. You could combine with cloud auto scaling group, though – OneCricketeer Jun 17 '23 at 12:02
note: s3 is not a real filesystem and there are some corner cases to be aware of. job commit needs a manifest format (iceberg, hudi, delta) or an s3 aware committer. logs written to s3 are only saved in close(), and more. If you want reliable logging: save the intermediate logs somewhere – stevel Jun 26 '23 at 13:05
@stevel Are you suggesting using something like fluentd to collect driver/executor logs and ship to Elastic, Splunk, etc? In other words, if not using yarn, then log aggregation would work differently? – OneCricketeer Jun 26 '23 at 14:44
yarn log aggregation has apps logging to local disk, then copied to cluster fs after. I'm thinking of anything which wants to stream to a cluster fs and expects all the output to be there even in the presence of failures. You can often identify code which does this as it calls hsync() and hflush() on the output stream expecting data to persist. If you run with a recent hadoop release and set `fs.s3a.downgrade.syncable.exceptions` to false you can get the s3a connector to fail on this -we've turned it on from time to time to see what has unrealistic expectations... – stevel Jun 27 '23 at 18:44

Apache-spark cluster without HDFS?

1 Answers1