spark read a file based on value in Dataframe

Question

I'm reading messages from kafka. The messages schema is -

schema = StructType([
    StructField("file_path", StringType(), True),
    StructField("table_name", StringType(), True),
])

For each row in the dataframe that I read, I want to open the file in the specified file_path, and write it to a delta lake table with the same name as in the column table_name.

So for example if the row in the dataframe is -

-------------------------------
|   file_path   | table_name  |
-------------------------------
| /tmp/file.csv |   table_1   |
-------------------------------

I want to be able to do -

data = spark.read.csv(df["file_path"])
data.write.format("delta").mode("append").saveAsTable(df["table_name"])

If your dataframe is not huge, you should `collect` it to a native Python collection and then perform the write option iteratively. — philantrovert, Sep 13 '22 at 11:40
This answers my question exactly - https://stackoverflow.com/questions/65777481/read-file-path-from-kafka-topic-and-then-read-file-and-write-to-deltalake-in-str/65809786#65809786 — Kallie, Sep 15 '22 at 06:53

score 0 · Answer 1 · answered Sep 13 '22 at 13:30

Once you receive the data stream from kafka, maybe you can try something like this.

dataframe.foreach(rowObject =>{
// get row object
// have your logic to read the rowObject(0) which has the file path and write to table rowObject(1) which has table name
data = spark.read.csv(rowObject(0))
data.write.format("delta").mode("append").saveAsTable("rowObject(1)")
})

spark read a file based on value in Dataframe

1 Answers1