I'm trying to understand if it is possible to run Spark on a cluster without Hadoop services. It seems like it should be, given pages like standlone spark and spark on mesos, but even that doesn't offer an alternative to HDFS. Is there one? The goal is finding a way to deploy a Spark cluster without signing up to manage a Hadoop cluster, too.
Asked
Active
Viewed 81 times
0
-
This post implies that you can use NFS instead: https://stackoverflow.com/questions/32542719/using-apache-spark-with-hdfs-vs-other-distributed-storage?rq=2 but you won't get the scaling of HDFS – Wheezil Jun 17 '23 at 11:38
1 Answers
0
NFS, S3 (Minio), GCS, Azure WASB, Databricks DBFS, Ceph all work with Spark.
Spark on Mesos is deprecated according to the note on its page, in favor of Kubernetes, which is even more complicated to manage than Hadoop.

OneCricketeer
- 179,855
- 19
- 132
- 245
-
To use S3 or GCS with Spark, do you have to define the default HadoopFileSystem on the cloud storage using the Hadoop configuration files? – Wheezil Jun 17 '23 at 11:55
-
-
Spark still relies on hadoop-client JAR files, so yes, you'd configure external filesystems via Hadoop core-site.xml config, or do it in your Spark app via SparkSession config – OneCricketeer Jun 17 '23 at 11:56
-
Standalone mode doesn't elastically scale. Depends on your needs, really. – OneCricketeer Jun 17 '23 at 11:57
-
Meaning... in standalone you have to choose a fixed number of VMs during configuration? Makes sense. FWIW, our devops teams is already neck-deep in K8S, so that's actually a good alternative to Hadoop for us. – Wheezil Jun 17 '23 at 11:58
-
Standalone mode has designated master/worker instances, yes. You could combine with cloud auto scaling group, though – OneCricketeer Jun 17 '23 at 12:02
-
note: s3 is not a real filesystem and there are some corner cases to be aware of. job commit needs a manifest format (iceberg, hudi, delta) or an s3 aware committer. logs written to s3 are only saved in close(), and more. If you want reliable logging: save the intermediate logs somewhere – stevel Jun 26 '23 at 13:05
-
@stevel Are you suggesting using something like fluentd to collect driver/executor logs and ship to Elastic, Splunk, etc? In other words, if not using yarn, then log aggregation would work differently? – OneCricketeer Jun 26 '23 at 14:44
-
yarn log aggregation has apps logging to local disk, then copied to cluster fs after. I'm thinking of anything which wants to stream to a cluster fs and expects all the output to be there even in the presence of failures. You can often identify code which does this as it calls hsync() and hflush() on the output stream expecting data to persist. If you run with a recent hadoop release and set `fs.s3a.downgrade.syncable.exceptions` to false you can get the s3a connector to fail on this -we've turned it on from time to time to see what has unrealistic expectations... – stevel Jun 27 '23 at 18:44