What I've Tried
I have JSON data which comes from an API. I saved all the data into a single directory. Now I am trying to load this data into a spark dataframe, so I can do ETL on it. The API returned fragmented data (per division). Moreover, some divisions had little data and some had a lot. The directory I'm trying to load data from looks as follows:
* json_file_1 - 109MB
* json_file_2 - 2.2MB
* json_file_3 - 67MB
* json_file_4 - 105MB
* json_file_5 - **2GB**
* json_file_6 - 15MB
* json_file_7 - 265MB
* json_file_8 - 35MB
* json_file_9 - **500KB**
* json_file_10 - 383MB
I'm using Azure Synapse and an Apache Spark Pool, the data directory i'm loading from resides in an ADLS2 data lake. I'm using the following code to load all data files that reside in the directory. For other projects this code works fine and fast.
blob_path_raw = 'abfss://blob_container@my_data_lake.dfs.core.windows.net'
df = spark.read.json(path=f"{blob_path_raw}/path_to_directory_described_above")
My Question
The code above is taking extremely long to run (at the time of writing already more than 3 hours), and I suspect it got stuck somewhere, as loading +-4GB of data is something a Spark pool should be easily able to do. I suspect something is going wrong in Spark because of the heterogenous sizes of data files. But I am still rather novice in Spark as we just migrated to Azure Synapse. What is going wrong here, and how do I debug it?