How to fix "'Path does not exist" when importing csv in Pyspark

Question

I was following this tutorial (https://www.guru99.com/pyspark-tutorial.html) and trying to read the csv file with sqlContext.read.csv, but this error showed:'Path does not exist: file:/C:/Users/asus/AppData/Local/Temp/spark-62c50c87-060e-49f7-b331-111abfa496f3/userFiles-da6cdfff-ea8a-426c-b4f4-fe5a15c67794/adult.csv;'

I heard that I might have to copy the file across all the nodes of same shared file system or use HDFS, but I don't know exactly how I should do these.

This is the code:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True)

This is the result I got:

AnalysisException: 'Path does not exist: file:/C:/Users/asus/AppData/Local/Temp/spark-62c50c87-060e-49f7-b331-111abfa496f3/userFiles-da6cdfff-ea8a-426c-b4f4-fe5a15c67794/adult.csv;'

The error message appears to think there is a semicolon after the filename. Maybe that is the problem? — John Gordon, Aug 20 '19 at 21:01
^ thats a negative, semicolon is added by sloppy py exception formatter — mazaneicha, Aug 20 '19 at 21:47
@luoyang I think your tutorial should instruct you to add a file to the context first, via `sc.addFile("adult.csv")`. — mazaneicha, Aug 20 '19 at 21:50
@Luo Yang--Please see answer below. This can solve your issue. https://stackoverflow.com/questions/57014043/reading-data-from-url-using-spark-databricks-platform/57019702#57019702 — vikrant rana, Aug 21 '19 at 05:21

score 0 · Answer 1 · answered Aug 20 '19 at 22:07

You should follow the instruction on the website you pasted, do following first:

url = "https://raw.githubusercontent.com/guru99-edu/R-Programming/master/adult_data.csv"
from pyspark import SparkFiles
sc.addFile(url)
sqlContext = SQLContext(sc)

Then you can load the file with read.csv:

df = sqlContext.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True)

sc.addFile(url) sets the root directory for SparkFiles to use in get function. You can run this to check the current root directory:

SparkFiles.getRootDirectory()

It should looks like this:

C:/Users/asus/AppData/Local/Temp/spark-62c50c87-060e-49f7-b331-111abfa496f3/userFiles-da6cdfff-ea8a-426c-b4f4-fe5a15c67794/

So when you call SparkFiles.get('adult.csv'), Spark is looking file under that directory, that's why you saw the error message.

Another solution is to download the file, put into your local directory, and run :

df = spark.read.csv(your_local_path_to_adult.csv, header=True, inferSchema= True)

How to fix "'Path does not exist" when importing csv in Pyspark

1 Answers1