0

I have a for loop running in databricks and the first iterations run fast, then it gets slower and then it doesn't proceeds at all. While I know that is common in for loops if data size is increasing on each iteration and/or there's garbage variables not being deleted, at least I should see that either RAM/CPU/DISK/NETWORK are nearing 100%, right? but they are not, in fact they are not used at all. And also, at least I hoped to see spark Jobs being processed at SparkUI but there are none when the for loop gets stuck. Despite the notebook cell says the cell is still running.

So do I have some resource that got clugged that is neither RAM/CPU/DISK/NETWORK? And why spark shows no running jobs despite the for is still running?

Also, spark UI shows there's no job running, despite the notebook says the command is still running, and I know for sure it hasn't finished because I should see files getting created on AWS S3.

My case is that I am trying to generate some dummy data for testing. My goal is to create many csv files of the same dataframe schema that vary just in a datestamp column, so to emulate incremental load scenario. My goal is to have 400 csv that are essentially the same dataframe but with the datestamp column changing by 1 day, as if I received a file on Jan 01, another on Jan 02, another in Jan 03 and so on for 400 days.

For that, I have my "base" dataframe input_df, and I have a for loop in databricks that reads from it, increases the id column and datestamp column (as if they were new rows) and writes into S3 with a datestamp string. Here is the loop

another=input_df

for i in range(400):
    # I get the min and max values of column "id" to then add the ids for the new df
    aggs=another.agg(max('id'),min('id'))
    max_id=aggs.collect()[0][0]
    min_id=aggs.collect()[0][1]
   
    # Here I add the id column to emulate "new data" and the datestamp column
    another=another.withColumn('id',col('id')+max_id-min_id+1).\
        withColumn('created_time',date_add(col('created_time'),1)).\
        withColumn('created_time',date_format(col("created_time"), "yyyy-MM-dd'T'HH:mm:ss.SSSZ"))

    #here I create the file name by using the datestamp
    date_procesed=datetime.strptime('20220112','%Y%m%d') + timedelta(days=i+1)
    date_procesed=date_procesed.strftime('%Y%m%d')
    print(date_procesed)

    #And here I write it in a single csv file
    another.coalesce(1).write.option('header','true').csv('dbfs:/tmp/wiki/transaction/'+date_procesed)

Now the cell of the notebook runs for about 11 files (that created and completed about 40 jobs) and then stops. I thought some resource was nearing capacity. But this is my problem

  1. At Spark UI no job is running. As if the notebook is not even creating more jobs
  2. All the first 40 jobs created my the first 11 iterations of the loop are completed (and I see the files written on S3)
  3. At GangliaUI I see only the driver is doing everything (which is expected cause my input_df was created using the rand library. But neither its CPU/RAM/NETWORK/DISK are full, and they are not even fluctuating for nearly an hour

here's a picture of the driver at ganglia. you can see the resources (CPU,RAM,NETWORK) kind of do spike at some point in time (when the for loop was actually working, creating and completing spark jobs). But then they downsized and are stable, but the for loop stops and I know it shouldn't .

My ultimate questions are

  1. Do I have some resource that got clugged that is neither RAM/CPU/DISK/NETWORK?
  2. And why spark shows no running jobs despite the for is still running? Why this for loop
  3. doesn't works? I know they are not good practice in python, let alone in spark. But I have no answer why it's not processing

Ganglia

Also as a side note, I noticed this on processes running on the driver (despite there's no spar job running) So I suspect spark is not closing some ports/connections or declaring processes as done

Process running

2 Answers2

2

Instead of writing 400 times I would try writing once. Albeit this method puts them in 400 folders in dbfs with one file each. Can write a script after the fact to relocate to one folder and rename them based on their current directory. A workaround but worth a shot if your current job isn’t working.

final_df = spark.createDataFrame()

for i in range(400):
    # do your stuff
    final_df = final_df.union(another)

final_df.partitionBy(date_processed).write()
0

You need to understand some concepts of Spark:

  1. Spark is lazy evaluated. All the transformation will be applied only when action is called.
  2. Transformations and actions will be converted to execution query plan.
  3. Without persistence, data will be re-computed every time even some computations are duplicated.

Let's take an easy example first:

df_output = df_a.join(df_b, on='col', how='inner')
df_output.show(10)
df_output.write.csv(path)

Suppose you have a dataframe, you do some joining, grouping and showing to check the result. You're happy with the result and want to save the result. In this example, you have two execution plan:

  1. Plan 1 (trigger by show): file scanning -> joining -> grouping -> showing
  2. Plan 2 (trigger by write): file scanning -> joining -> grouping -> file writing

If you add persistence in the above case:

df_output = df_a.join(df_b, on='col', how='inner').persist()
df_output.show(10)
df_output.write.csv(path)

The query plan will be changed to:

  1. Plan 1 (trigger by show): file scanning -> joining -> grouping -> showing + data persisting
  2. Plan 2 (trigger by write): get the persisted data -> file writing

If you use the above example and apply to your question, both aggs, max_id, min_id and another result is based on the previous round of computation result to collect. Therefore, when your looping round becomes bigger, the latter round need to recompute the previous round to get the result if you don't have any data persistence. It will cause a huge cost for your both query planning, optimization and execution. You can try to add:

another=another.withColumn('id',col('id')+max_id-min_id+1).\
    withColumn('created_time',date_add(col('created_time'),1)).\
    withColumn('created_time',date_format(col("created_time"), "yyyy-MM-dd'T'HH:mm:ss.SSSZ")).\
    .persist()

Note that persistence is not applied to all use case and you need to configure the resource carefully. You can test persist in your example and understand that this is not related to resource management.

Jonathan Lam
  • 1,761
  • 2
  • 8
  • 17