How to execute the 'df = spark.read.parquet()' command from inside a for loop?

Question

I'm reading parquet files located in Azure Storage. I have multiple source folders where each folder needs to be a separate dataframe. I'm looping through each of the source folders and creating the commands for dataframe creation using a for loop like so:

df_apples = spark.read.option('mergeSchema', 'true').parquet('abfss://files@mystorageaccount.dfs.core.windows.net/fruits/source_parquet/apples')

df_oranges = spark.read.option('mergeSchema', 'true').parquet('abfss://files@mystorageaccount.dfs.core.windows.net/fruits/source_parquet/oranges')

df_bananas = spark.read.option('mergeSchema', 'true').parquet('abfss://files@mystorageaccount.dfs.core.windows.net/fruits/source_parquet/bananas')

df_mangoes = spark.read.option('mergeSchema', 'true').parquet('abfss://files@mystorageaccount.dfs.core.windows.net/fruits/source_parquet/mangoes')

But I'm not being able to run them from inside this loop. If it were a SQL statement, I could simply call it from spark.sql(<my_sql_statement>) in order to run the SQL statement. But in the case of spark.read(), I'm unable to run it and also create a dataframe.

Maybe this is really simple, but I'm just not getting it. Can someone please help?

why do you need seperate dataframe ? adding source file as column is not enough ? https://stackoverflow.com/questions/39868263/spark-load-data-and-add-filename-as-dataframe-column — maxime G, Jun 14 '23 at 06:53

score 0 · Answer 1 · answered Jun 14 '23 at 22:43

You can easily use spark.read inside a loop, but to do this, you will need some data structure like a list, dictionary or tuple to perform this operation.

A simple example of what is possible can be watched following.

path_apples = "/apples/"
path_orange = "/orange/"
path_bananas = "/bananas/"
path_mongoes = "/mongoes/"

dict_paths = {
    "apples": path_apples,
    "orange": path_orange,
    "bananas": path_bananas,
    "mongoes": path_mongoes,
}
dict_dfs = {}


for i in dict_paths.items():
    path = i[1]
    df = spark.read.option("mergeSchema", "true").parquet(path)
    dict_dfs[path] = df


dict_dfs

Make sure to adapt as you need.

How to execute the 'df = spark.read.parquet()' command from inside a for loop?

1 Answers1