File Not Found Error while reading from S3 - PySpark

Question

I am trying to read a .csv file on s3 into a PySpark dataframe in Glue. However it keeps on failing with "AnalysisException: Path does not exist: s3://kp-landing-dev/input/kp/kp/export_incr_20230611183316.csv" error. I have verified the path and the file a hundred times and they do exist (on s3). I am breaking my head over this.

I have even built a function (below) to check if the file exists on s3 and it runs successfully and returns the file-name correctly. Hence, bit confused as to where I am going wrong.

def get_file_name(bucket_name, directory_remote, file_name):
    try:
        s3 = boto3.client('s3')
        objects = s3.list_objects_v2(Bucket=bucket_name, Prefix=f"{directory_remote}/{file_name}")
        for obj in objects['Contents']:
            return(obj['Key'].split("/")[-1])
    except:
        print("Error encountered in get_file_name")

bucket = "kp-landing-dev"
directory_remote = "input/kp/kp"
toName_csv = "export_incr_20230611183316"
myfile = get_file_name(bucket, directory_remote, toName_csv) --> outputs- "export_incr_20230611183316"

df = spark.read.options(header='true').csv("s3://" + f"{bucket}" + "/" + f"{directory_remote}" + "/" + f"{myfile}", schema=schema)

Can someone please help.

Did you try spark.read.csv("s3://kp-landing-dev/input/kp/kp/") ? — parisni, Jun 11 '23 at 20:34
thank you for your reply. It works, however, there are other files in the folder which I don't want to read. — marie20, Jun 12 '23 at 01:16

score 0 · Answer 1 · answered Jun 12 '23 at 07:58

0

You can use:

df = spark.read.csv("s3://kp-landing-dev/input/kp/kp/*.csv")

answered Jun 12 '23 at 07:58

parisni

920
7
20

File Not Found Error while reading from S3 - PySpark

1 Answers1