How to run spark distributed in cluster mode, but take file locally?

Question

Is it possible to have spark take a local file as input, but process it distributed?

I have sc.textFile(file:///path-to-file-locally) in my code, and I know that the exact path to the file is correct. Yet, I am still getting

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 14, spark-slave11.ydcloud.net): java.io.FileNotFoundException: File file:/<path to file> does not exist

I am running spark distributed, and not locally. Why the error exist?

score 3 · Accepted Answer · answered Jul 05 '16 at 19:15

3

It is possible but when you declare local path as an input it has to be present on each worker machine and the driver. So it means you have to distribute it first either manually or using built-in tools like SparkFiles.

answered Jul 05 '16 at 19:15

zero323

322,348
103
959
935

Thanks. I suppose you could also use hadoop commands to distribute the file first, run spark, and then delete the file using hadoop commands. Right? – buzzinolops Jul 05 '16 at 19:19
Sure. The main point is - if you read data it has to accessible on each machine in the cluster. – zero323 Jul 05 '16 at 19:22

score 3 · Answer 2 · answered Jul 05 '16 at 20:15

The files must be located at a centralized location, which is accessible to all the nodes. This can be achieved by using a distributed file system, dse provides a replacement for HDFS called CFS(Cassandra File System). The cfs is available when dse is started in analytic mode using -k option.

For further details of setting up and using cfs, you can have a look at the following link http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/ana/anaCFS.html

How to run spark distributed in cluster mode, but take file locally?

2 Answers2