0

Here's the scenario :

Pyspark is generating required csv format(part files) along with extra files like

_SUCCESS", "_commited" , "_started"

So when these files are saved in blob storage ,blob trigger executes 4 times(number of files added to the blob).Is there a better way to avoid it in & execute blob trigger only when csv file is generated?

Rakesh Govindula
  • 5,257
  • 1
  • 2
  • 11
sarav
  • 47
  • 9
  • How is the Trigger defined? In Logic Apps, Data Factory, and Synapse, you can specify blob name prefixes (great for folder paths) and suffixes (like '.csv') to only process files that match the pattern. – Joel Cochran Dec 22 '22 at 14:31
  • Unless you are overwriting them, these files are a feature of spark. – John Stud Dec 22 '22 at 21:58
  • @JoelCochran Sorry .Should've mentioned with more details .I am using Blob trigger Azure function. – sarav Dec 27 '22 at 11:44

1 Answers1

0

If your Blob trigger is in Data factory, logic apps or Synapse, you can give .csv as suffix as suggested by @Joel Charan in comments.

Example in Data factory Blob trigger

enter image description here

_SUCCESS", "_commited" , "_started"

These files will be created by default in spark.

If you want to avoid them and only store a single csv file, another alternative would be to convert Pyspark dataframe into pandas dataframe and store it in a single csv file after mounting.

Code for generating dynamic csv file name using date. Getting current date in string format code taken from this answer by stack0114106.

from pyspark.sql.functions import current_timestamp
dateFormat = "%Y%m%d_%H%M"
ts=spark.sql(""" select current_timestamp() as ctime """).collect()[0]["ctime"]
sub_fname=ts.strftime(dateFormat)

filename="/dbfs/mnt/data/folder1/part-"+sub_fname+".csv"
print(filename)

pandas_converted=df.toPandas()
pandas_converted.to_csv(filename)

enter image description here

Single csv file in Blob:

enter image description here

Rakesh Govindula
  • 5,257
  • 1
  • 2
  • 11