pyspark streaming dataframe write to different path depending on column values

Asked Oct 04 '22 at 15:12

Active Oct 04 '22 at 15:12

Viewed 103 times

In databricks notebook, I am reading json files with readStream, json has structure for example:

id	entityType	eventId
1	person	123
2	employee	234
3	client	687
4	client	687

My code:

cloudfile = {
"cloudFiles.format": "json",
"cloudFiles.schemaLocation": SCHEMA_LOCATION
"cloudFiles.useNotifications", True}


df = (spark.readStream
  .format('cloudfiles')
  .options(**cloudfile)
  .load(SOURCE_PATH)
 )

How can I write it using writeStream to different folders, depending on column values?

Output exmaple:

mainPath/{entityType}/{eventId}/data.json

entity with id = 1 to file: mainPath/person/123/data.json
entity with id = 2 to file: mainPath/employee/234/data.json
entity with id = 3 to file: mainPath/client/687/data.json
...

asked Oct 04 '22 at 15:12

Yerassyl Aben

pyspark streaming dataframe write to different path depending on column values

0 Answers0