2

I'm attempting to crawl through a directory in a databricks notebook to find the latest parquet file. dbfsutils.fs.ls does not appear to support any metadata about files or folders. Are there any alternative methods in python to do this? The data is stored in an azure data lake mounted to the DBFS under "/mnt/foo". Any help or pointers is appreciated.

Daveed
  • 149
  • 2
  • 8

1 Answers1

4

On Azure Databricks as I known, the dbfs path dbfs:/mnt/foo is same as the Linux path /dbfs/mnt/foo, so you can simply use os.stat(path) in Python to get the file metadata like create date or modified date.

enter image description here

Here is my sample code.

import os
from datetime import datetime
path = '/dbfs/mnt/test'
fdpaths = [path+"/"+fd for fd in os.listdir(path)]
for fdpath in fdpaths:
    statinfo = os.stat(fdpath)
    create_date = datetime.fromtimestamp(statinfo.st_ctime)
    modified_date = datetime.fromtimestamp(statinfo.st_mtime)
    print("The statinfo of path %s is %s, \n\twhich create date and modified date are %s and %s" % (fdpath, statinfo, create_date, modified_date))

And the result is as the figure below.

enter image description here

Peter Pan
  • 23,476
  • 4
  • 25
  • 43
  • Caveat with this however: Though it appears Databricks DBFS stores file metadata, you have very limited control over adjusting things like ownership, mode, and utime/mtime. I am unable to change them through python or a shell command. – Azmisov Mar 10 '21 at 20:23