While using blob trigger ,is there a way to execute it only when the .csv file is created ignoring the other files that's generated by the pyspark?

Question

Here's the scenario :

Pyspark is generating required csv format(part files) along with extra files like

_SUCCESS", "_commited" , "_started"

So when these files are saved in blob storage ,blob trigger executes 4 times(number of files added to the blob).Is there a better way to avoid it in & execute blob trigger only when csv file is generated?

How is the Trigger defined? In Logic Apps, Data Factory, and Synapse, you can specify blob name prefixes (great for folder paths) and suffixes (like '.csv') to only process files that match the pattern. — Joel Cochran, Dec 22 '22 at 14:31
Unless you are overwriting them, these files are a feature of spark. — John Stud, Dec 22 '22 at 21:58
@JoelCochran Sorry .Should've mentioned with more details .I am using Blob trigger Azure function. — sarav, Dec 27 '22 at 11:44

score 0 · Answer 1 · answered Dec 23 '22 at 10:41

If your Blob trigger is in Data factory, logic apps or Synapse, you can give .csv as suffix as suggested by @Joel Charan in comments.

Example in Data factory Blob trigger

enter image description here

_SUCCESS", "_commited" , "_started"

These files will be created by default in spark.

If you want to avoid them and only store a single csv file, another alternative would be to convert Pyspark dataframe into pandas dataframe and store it in a single csv file after mounting.

Code for generating dynamic csv file name using date. Getting current date in string format code taken from this answer by stack0114106.

from pyspark.sql.functions import current_timestamp
dateFormat = "%Y%m%d_%H%M"
ts=spark.sql(""" select current_timestamp() as ctime """).collect()[0]["ctime"]
sub_fname=ts.strftime(dateFormat)

filename="/dbfs/mnt/data/folder1/part-"+sub_fname+".csv"
print(filename)

pandas_converted=df.toPandas()
pandas_converted.to_csv(filename)

enter image description here

Single csv file in Blob:

enter image description here

While using blob trigger ,is there a way to execute it only when the .csv file is created ignoring the other files that's generated by the pyspark?

1 Answers1