10

I am trying to keep a check for the file whether it is present or not before reading it from my pyspark in databricks to avoid exceptions? I tried below code snippets but i am getting exception when file is not present

from pyspark.sql import *
from pyspark.conf import SparkConf
SparkSession.builder.config(conf=SparkConf())
try:
    df = sqlContext.read.format('com.databricks.spark.csv').option("delimiter",",").options(header='true', inferschema='true').load('/FileStore/tables/HealthCareSample_dumm.csv')
    print("File Exists")
except IOError:
    print("file not found")`

When i have file, it reads file and "prints File Exists" but when the file is not there it will throw "AnalysisException: 'Path does not exist: dbfs:/FileStore/tables/HealthCareSample_dumm.csv;'"

Onkar Musale
  • 909
  • 10
  • 25
Amareshwar Reddy
  • 103
  • 1
  • 1
  • 4

7 Answers7

12

Thanks @Dror and @Kini. I run spark on cluster, and I must add sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]), here s3 is the prefix of the file system of your cluster.

  def path_exists(path):
    # spark is a SparkSession
    sc = spark.sparkContext
    fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
        sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
        sc._jsc.hadoopConfiguration(),
    )
    return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))
rosefun
  • 1,797
  • 1
  • 21
  • 33
  • This worked for me to verify a file or path existed on S3. However, I didnt need to split the path to create the URI. This is what worked for me. def path_exists(self, path, sc): # spark is a SparkSession fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get( sc._jvm.java.net.URI.create("s3://" + path), sc._jsc.hadoopConfiguration(), ) return fs.exists(sc._jvm.org.apache.hadoop.fs.Path("s3://" + path)) – Jacob Levinson Sep 16 '21 at 20:03
6
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/SUCCESS.txt"))
Prathik Kini
  • 1,067
  • 11
  • 25
  • 2
    Having S3 filesystem, this approach fails. [Here's a solution for S3](https://gist.github.com/drorata/73c15740f83d0cb0b187d00a57ca74a1#check-file-existence-on-s3-using-pyspark). – Dror Jul 07 '19 at 06:22
  • 1
    @dror do you know if there is a way to check if a path like `s3://my-bucket-s3-test/lookuo*.csv` exists? – andresg3 Sep 11 '20 at 01:46
  • @andresg3 did you find an answer to your question – Shivangi Singh Jul 22 '22 at 03:47
4

The answer posted by @rosefun worked for me but it took lot of time for me to get it working. So I am giving some details about how that solution is working and what are the stuffs you should avoid.

def path_exists(path):
    # spark is a SparkSession
    sc = spark.sparkContext
    fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
        sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
        sc._jsc.hadoopConfiguration(),
    )
    return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))

The function is same and it works fine to check whether a file exists or not in the S3 bucket path that you provided.

You will have to change this function based on how you are specifying your path value to this function.

path = f"s3://bucket-name/import/data/"
pathexists = path_exists(path)

if the path variable that you are defining is having the s3 prefix in the path then it would work.

Also the portion of the code which split the string gets you just the bucket name as follows:

path.split("/")[2] will give you `bucket-name`

but if you don't have s3 prefix in the path then you will have to use the function by changing some code and which is as below:

def path_exists(path):
   # spark is a SparkSession
   sc = spark.sparkContext
   fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
        sc._jvm.java.net.URI.create("s3://" + path),
        sc._jsc.hadoopConfiguration(),
   )
   return fs.exists(sc._jvm.org.apache.hadoop.fs.Path("s3://" + path))
Nikunj Kakadiya
  • 2,689
  • 2
  • 20
  • 35
2

Looks like you should change except IOError: to except AnalysisException:.

Spark throws different errors/exception than regular python in a lot of cases. It’s not doing typical python io operations when reading a file, so makes sense for it to throw a different exception.

dijksterhuis
  • 1,225
  • 11
  • 25
  • `AnalysisException` is thrown regularly by spark for many other situations, so even though it makes sense on the surface, it is better to check for reason why this Exception occured. So solution proposed by @Prathik makes more sense. – D3V Apr 09 '19 at 12:50
1

nice to see you on StackOverFlow.

I second dijksterhuis's solution, with one exception - Analysis Exception is very general exception in Spark, and may be resulted for various reasons, not only due to missing file.

If you want to check whether the file exists or not, you'll need to bypass Spark's FS abstraction, and access the storage system directly (Whether is s3, posix, or something else). The down side of this solution is the lack of abstraction - once you'll change your underlying FS, you'll need to change your code as well.

Elior Malul
  • 683
  • 6
  • 8
-1
dbutils.fs.ls(file_location)

Do not import dbutils. It's already there when you start your cluster.

Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38
Nayan Sarkar
  • 124
  • 1
  • 1
  • 6
  • it will still throw exception if file doesn't exist – Alex Ott Mar 31 '21 at 12:50
  • Hi @AlexOtt, this function provides you a list of files and folders in a given path. Now you have to be sure about the path up to a certain extent. correct? Then you can look what files and folders and in the sub system and go in accordingly. – Nayan Sarkar Apr 08 '21 at 06:44
-2

You can validate existence of a file as seen here:

import os

if os.path.isfile('/path/file.csv'):
     print("File Exists")
     my_df = spark.read.load("/path/file.csv")
     ...
else:            
     print("File doesn't exists")
Aaron M.
  • 135
  • 1
  • 5
  • 1
    os.path.isfile checks for the existence of file in the local filesystem. This always returns false. Even when file actually exists. – sparkDabbler Nov 16 '22 at 15:29