Find latest file pyspark

Question

So I've figured out how to find the latest file using python. Now I'm wondering if I can find the latest file using pyspark. Currently I specify a path but I'd like pyspark to get the latest modified file.

Current code looks like this:

df = sc.read.csv("Path://to/file", header=True, inderSchema=True)

Thanks in advance for your help.

Are the files on HDFS? – philantrovert May 25 '18 at 10:15 — philantrovert, May 25 '18 at 10:15

score 5 · Accepted Answer · answered May 25 '18 at 10:31

I copied the code to get the HDFS API to work with PySpark from this answer: Pyspark: get list of files/directories on HDFS path

URI           = sc._gateway.jvm.java.net.URI
Path          = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem    = sc._gateway.jvm.org.apache.hadoop.fs.s3.S3FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration

fs = # Create S3FileSystem object here

files = fs.listStatus(Path("Path://to/file"))

# You can also filter for directory here
file_status = [(file.getPath().toString(), file.getModificationTime()) for file in files]

file_status.sort(key = lambda tup: tup[1], reverse= True)

most_recently_updated = file_status[0][0]

spark.read.csv(most_recently_updated).option(...)

Thank you for your answer! I'm not familiar with pyspark at all so I'm just trying things as I go here. With your answer I get the following errormessage: AttributeError: 'SparkSession' object has no attribute '_gateway' Any idea why? — Eles, May 25 '18 at 11:47
`sc` is SparkContext. I think you have figured it out by now. — philantrovert, May 25 '18 at 12:22

Find latest file pyspark

1 Answers1

Linked