1

We are using Spark CSV reader to read the csv file to convert as DataFrame and we are running the job on yarn-client, its working fine in local mode.

We are submitting the spark job in edge node.

But when we place the file in local file path instead of HDFS, we are getting file not found exception.

Code:

sqlContext.read.format("com.databricks.spark.csv")
      .option("header", "true").option("inferSchema", "true")
      .load("file:/filepath/file.csv")

We also tried file:///, but still we are getting the same error.

Error log:

2016-12-24 16:05:40,044 WARN  [task-result-getter-0] scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, hklvadcnc06.hk.standardchartered.com): java.io.FileNotFoundException: File file:/shared/sample1.csv does not exist
        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)
        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
        at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
        at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
        at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)
        at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:241)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Shankar
  • 8,529
  • 26
  • 90
  • 159
  • does that file exists at that location ? – mrsrinivas Dec 24 '16 at 12:38
  • @mrsrinivas: yes its available, thats why when i run the job in yarn cluster in local mode, its working fine, only it not working in yarn-client mode. – Shankar Dec 24 '16 at 12:49
  • 1
    In normal case it has to work as you have tried. However , if the intention is to make it work then try [SparkFiles](https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/SparkFiles.html) your case something like this `import org.apache.spark.SparkFiles SparkContext.addFile("file:/filepath/file.csv") println(SparkFiles.getRootDirectory()) println(SparkFiles.get("file.csv")) sqlContext.read.format("com.databricks.spark.csv") .option("header", "true").option("inferSchema", "true") .load(SparkFiles.get("file.csv"))` – Ram Ghadiyaram Dec 24 '16 at 19:56
  • Also please post all the versions & spark-submit command along/as part of your question. – Ram Ghadiyaram Dec 24 '16 at 20:02
  • @Ram Ghadiyaram: thanks, I will try the Sparkfiles tomorrow and let you know.... – Shankar Dec 25 '16 at 13:50
  • @Ram Ghadiyaram: we are using Spark version 1.6.1 and corresponding Spark CSV reader – Shankar Dec 25 '16 at 13:52

2 Answers2

2

yes this will work fine in local mode but on edge node it wont work. Because from edge node the local file is not accessible. HDFS makes file accessible by specifying the URL of file.

Akash Sethi
  • 2,284
  • 1
  • 20
  • 40
  • 1
    Does it mean we cannot read any files from linux file path, only hdfs location should be used to read files? – Shankar Dec 24 '16 at 11:45
  • 1
    TBO I never tried this. What actually happen is that the path you provide for file must be accessible to master and worker node if nodes are unable to access the file then you face such issues. Now this question is based on networking. If you can make your local file accessible to master and worker node then you don't face such issue. – Akash Sethi Dec 24 '16 at 11:52
0

Seems like a bug. in spark-shell command when reading a local file, But there is a workaround while running spark-submit command just specify in command.

--conf "spark.authenticate=false"

SPARK-23476 for reference.

sumitya
  • 2,631
  • 1
  • 19
  • 32