I'm reading parquet files located in Azure Storage. I have multiple source folders where each folder needs to be a separate dataframe. I'm looping through each of the source folders and creating the commands for dataframe creation using a for
loop like so:
df_apples = spark.read.option('mergeSchema', 'true').parquet('abfss://files@mystorageaccount.dfs.core.windows.net/fruits/source_parquet/apples')
df_oranges = spark.read.option('mergeSchema', 'true').parquet('abfss://files@mystorageaccount.dfs.core.windows.net/fruits/source_parquet/oranges')
df_bananas = spark.read.option('mergeSchema', 'true').parquet('abfss://files@mystorageaccount.dfs.core.windows.net/fruits/source_parquet/bananas')
df_mangoes = spark.read.option('mergeSchema', 'true').parquet('abfss://files@mystorageaccount.dfs.core.windows.net/fruits/source_parquet/mangoes')
But I'm not being able to run them from inside this loop. If it were a SQL statement, I could simply call it from spark.sql(<my_sql_statement>)
in order to run the SQL statement. But in the case of spark.read()
, I'm unable to run it and also create a dataframe.
Maybe this is really simple, but I'm just not getting it. Can someone please help?