How to check a file/folder is present using pyspark without getting exception

Question

I am trying to keep a check for the file whether it is present or not before reading it from my pyspark in databricks to avoid exceptions? I tried below code snippets but i am getting exception when file is not present

from pyspark.sql import *
from pyspark.conf import SparkConf
SparkSession.builder.config(conf=SparkConf())
try:
    df = sqlContext.read.format('com.databricks.spark.csv').option("delimiter",",").options(header='true', inferschema='true').load('/FileStore/tables/HealthCareSample_dumm.csv')
    print("File Exists")
except IOError:
    print("file not found")`

When i have file, it reads file and "prints File Exists" but when the file is not there it will throw "AnalysisException: 'Path does not exist: dbfs:/FileStore/tables/HealthCareSample_dumm.csv;'"

score 12 · Answer 1 · answered Jul 10 '20 at 10:56

12

Thanks @Dror and @Kini. I run spark on cluster, and I must add sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]), here s3 is the prefix of the file system of your cluster.

  def path_exists(path):
    # spark is a SparkSession
    sc = spark.sparkContext
    fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
        sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
        sc._jsc.hadoopConfiguration(),
    )
    return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))

answered Jul 10 '20 at 10:56

rosefun

1,797
1
21
33

This worked for me to verify a file or path existed on S3. However, I didnt need to split the path to create the URI. This is what worked for me. def path_exists(self, path, sc): # spark is a SparkSession fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get( sc._jvm.java.net.URI.create("s3://" + path), sc._jsc.hadoopConfiguration(), ) return fs.exists(sc._jvm.org.apache.hadoop.fs.Path("s3://" + path)) – Jacob Levinson Sep 16 '21 at 20:03

Prathik Kini · Answer 2 · 2019-04-09T11:40:55.630

6

fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/SUCCESS.txt"))

edited Apr 09 '19 at 11:40

answered Apr 09 '19 at 10:33

Prathik Kini

1,067
11
25

2

Having S3 filesystem, this approach fails. [Here's a solution for S3](https://gist.github.com/drorata/73c15740f83d0cb0b187d00a57ca74a1#check-file-existence-on-s3-using-pyspark). – Dror Jul 07 '19 at 06:22
1

@dror do you know if there is a way to check if a path like `s3://my-bucket-s3-test/lookuo*.csv` exists? – andresg3 Sep 11 '20 at 01:46
@andresg3 did you find an answer to your question – Shivangi Singh Jul 22 '22 at 03:47

score 4 · Answer 3 · answered Oct 06 '21 at 13:32

The answer posted by @rosefun worked for me but it took lot of time for me to get it working. So I am giving some details about how that solution is working and what are the stuffs you should avoid.

def path_exists(path):
    # spark is a SparkSession
    sc = spark.sparkContext
    fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
        sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
        sc._jsc.hadoopConfiguration(),
    )
    return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))

The function is same and it works fine to check whether a file exists or not in the S3 bucket path that you provided.

You will have to change this function based on how you are specifying your path value to this function.

path = f"s3://bucket-name/import/data/"
pathexists = path_exists(path)

if the path variable that you are defining is having the s3 prefix in the path then it would work.

Also the portion of the code which split the string gets you just the bucket name as follows:

path.split("/")[2] will give you `bucket-name`

but if you don't have s3 prefix in the path then you will have to use the function by changing some code and which is as below:

def path_exists(path):
   # spark is a SparkSession
   sc = spark.sparkContext
   fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
        sc._jvm.java.net.URI.create("s3://" + path),
        sc._jsc.hadoopConfiguration(),
   )
   return fs.exists(sc._jvm.org.apache.hadoop.fs.Path("s3://" + path))

score 2 · Answer 4 · answered Apr 09 '19 at 10:01

2

Looks like you should change except IOError: to except AnalysisException:.

Spark throws different errors/exception than regular python in a lot of cases. It’s not doing typical python io operations when reading a file, so makes sense for it to throw a different exception.

answered Apr 09 '19 at 10:01

dijksterhuis

1,225
11
25

`AnalysisException` is thrown regularly by spark for many other situations, so even though it makes sense on the surface, it is better to check for reason why this Exception occured. So solution proposed by @Prathik makes more sense. – D3V Apr 09 '19 at 12:50

score 1 · Accepted Answer · answered Apr 09 '19 at 10:18

nice to see you on StackOverFlow.

I second dijksterhuis's solution, with one exception - Analysis Exception is very general exception in Spark, and may be resulted for various reasons, not only due to missing file.

If you want to check whether the file exists or not, you'll need to bypass Spark's FS abstraction, and access the storage system directly (Whether is s3, posix, or something else). The down side of this solution is the lack of abstraction - once you'll change your underlying FS, you'll need to change your code as well.

score -1 · Answer 6 · edited Mar 31 '21 at 19:01

-1

dbutils.fs.ls(file_location)

Do not import dbutils. It's already there when you start your cluster.

edited Mar 31 '21 at 19:01

Mustafa Aydın

17,645
4
15
38

answered Mar 31 '21 at 12:18

Nayan Sarkar

124
1
1
6

it will still throw exception if file doesn't exist – Alex Ott Mar 31 '21 at 12:50
Hi @AlexOtt, this function provides you a list of files and folders in a given path. Now you have to be sure about the path up to a certain extent. correct? Then you can look what files and folders and in the sub system and go in accordingly. – Nayan Sarkar Apr 08 '21 at 06:44

score -2 · Answer 7 · answered Dec 31 '20 at 14:54

-2

You can validate existence of a file as seen here:

import os

if os.path.isfile('/path/file.csv'):
     print("File Exists")
     my_df = spark.read.load("/path/file.csv")
     ...
else:            
     print("File doesn't exists")

answered Dec 31 '20 at 14:54

Aaron M.

135
1
5

1

os.path.isfile checks for the existence of file in the local filesystem. This always returns false. Even when file actually exists. – sparkDabbler Nov 16 '22 at 15:29

How to check a file/folder is present using pyspark without getting exception

7 Answers7

Linked