1

I have a Spark standalone cluster having 2 worker nodes and 1 master node.

Using spark-shell, I was able to read data from a file on local filesystem, then did some transformations and saved the final RDD in /home/output(let's say) The RDD got saved successfully but only on one worker node and on master node only _SUCCESS file was there.

Now, if I want to read this output data from /home/output, I am not getting any data as it is getting 0 data on master and then I am assuming that it is not checking the other worker nodes for that.

It would be great if someone can throw some light on why Spark is not reading from all the worker nodes or what is the mechanism which Spark uses to read the data from worker nodes.

scala> sc.textFile("/home/output/")
res7: org.apache.spark.rdd.RDD[(String, String)] = /home/output/ MapPartitionsRDD[5] at wholeTextFiles at <console>:25

scala> res7.count
res8: Long = 0
sForSujit
  • 987
  • 1
  • 10
  • 24
codeogeek
  • 652
  • 1
  • 8
  • 22

2 Answers2

0

SparkContext i.e. sc by default points to HADOOP_CONF_DIR.This is generally set to hdfs:// , which means when you say sc.textFile("/home/output/") it searches for the file/dir as hdfs:///home/output , which in your case is not present on HDFS. file:// points to local filesystem

Try sc.textFile("file:///home/output") ,thus explicitly telling Spark to read from the local filesystem.

serious_black
  • 437
  • 2
  • 4
  • 14
  • I tried that but didn't work. The present condition is that master has output folder with _SUCCESS file and the worker nodes has the rest part files in the "output" folder. Now, when i am reading this output folder, it is giving me blank which i assume that it is reading only the master. – codeogeek Aug 18 '17 at 09:14
  • can you please provide the initial steps , using which you wrote /home/output ? – serious_black Aug 18 '17 at 09:54
0

You should put the file to all worker machine with the same path and name.

Robin
  • 695
  • 1
  • 7
  • 10