Add the creation date of a parquet file into a DataFrame

Question

Currently I load multiple parquet file with this code :

df = spark.read.parquet("/mnt/dev/bronze/Voucher/*/*")

(Into the Voucher folder, there is one folder by date, and one parquet file inside it)

How can I add the creation date of each parquet file into my DataFrame ?

Thanks

EDIT 1:

Thanks rainingdistros, I wrote this:

import os
from datetime import datetime, timedelta 

Path = "/dbfs/mnt/dev/bronze/Voucher/2022-09-23/"
fileFull = Path +'/'+'XXXXXX.parquet'
statinfo = os.stat(fileFull)
create_date = datetime.fromtimestamp(statinfo.st_ctime)
display(create_date)

Now I must find a way to loop through all the files and add a column in the DataFrame.

By creation date, you mean the folder name ? i.e. Voucher// ? If so try using the input_file_name() function as mentioned in the link https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.input_file_name.html and then use a substring perhaps to get the date.... — rainingdistros, Nov 18 '22 at 10:15
The creation date of the parquet file (ok it's the same date of folder name, but I think it's better to work with the properties information of the parquet file rather than a folder name, right ?). Thanks for the link. — Salah K., Nov 18 '22 at 10:39
Well dbutils has the mtime - you could list the files and then while looping capture the mtime and then add it as a column. Refer https://docs.databricks.com/dev-tools/databricks-utils.html#ls-command-dbutilsfsls - Alternatively you could use the os module to capture the creationtime - Refer https://pynative.com/python-file-creation-modification-datetime/ - Note - I am not able to get it to work with databricks though - Got the error - `'os.stat_result' object has no attribute 'st_birthtime'` - so if it is all the same you could use ctime instead. — rainingdistros, Nov 18 '22 at 11:13
Here is another link - https://stackoverflow.com/questions/51932783/how-to-loop-through-azure-datalake-store-files-in-azure-databricks get the list of files as shown in the link - then for each file use the code you have written. Please note ctime - records the time there was a change in the metadata - ideally should have been birthtime - but that does not seem to work in my trials...Hope it works... — rainingdistros, Nov 20 '22 at 10:50
Hey @rainingdistros, consider posting it as answer so it helps other community memebrs facing the same issue. — Saideep Arikontham, Nov 23 '22 at 07:04

score 1 · Accepted Answer · answered Dec 01 '22 at 09:14

The information returned from os.stat might not be accurate unless the file is first operation on these files is your requirement (i.e., adding the additional column with creation time).
Each time the file is modified, both st_mtime and st_ctime will be updated to this modification time. The following are the images indicating the same:

enter image description here

When I modify this file, the changes can be observed in the information returned by os.stat.

enter image description here

So, if adding this column is the first operation that is going to be performed on these files, then you can use the following code to add this date as column to your files.

from pyspark.sql.functions import lit
import pandas as pd
path = "/dbfs/mnt/repro/2022-12-01"
fileinfo = os.listdir(path)
for file in fileinfo:
    pdf = pd.read_csv(f"{path}/{file}")
    pdf.display()
    statinfo = os.stat("/dbfs/mnt/repro/2022-12-01/sample1.csv")
    create_date = datetime.fromtimestamp(statinfo.st_ctime)
    pdf['creation_date'] = [create_date.date()] * len(pdf)
    pdf.to_csv(f"{path}/{file}", index=False)

enter image description here

These files would have this new column as shown below after running the code:

enter image description here

It might be better to take the value directly from folder in this case as the information is already available and all that needs to be done is to extract and add column to files in a similar manner as in the above code.

score 0 · Answer 2 · answered Nov 23 '22 at 07:49

See if below steps help....

Refer to the link to get the list of files in DBFS - SO - Loop through Files in DBFS
Once you have the files, loop through them and for each file use the code you have written in your question.

Please note that dbutils has the mtime of a file in it. The os module provides way to identify the ctime i.e. the time of most recent metadata changes on Unix, - ideally should have been st_birthtime - but that does not seem to work in my trials...Hope it works for you...

Add the creation date of a parquet file into a DataFrame

2 Answers2