It appears that you are attempting to reference SparkContext from a broadcast

Question

I'm trying to process some events with Spark Structured Streaming.

The incoming events looks like:

Event 1:

url
http://first/path/to/read/from...

Event 2:

url
http://second/path/to/read/from...

And so on.

My goal is to read each of these urls and generate a new DF. So far I've done it with a code like this where I did a collect().


def createDF(url):

    file_url = "abfss://" + container + "@" + az_storage_account + ".dfs.core.windows.net/" + az_storage_folder + "/" + url

    """ Read data """
    binary = spark.read.format("binaryFile").load(file_url)
    """ Do other operations """
    ...

    """ save the data """
    # write it into blob again

    return something

def loadData(batchDf, batchId):

    
    """
    batchDf:
        +--------------------+---------+-----------+--------------+--------------------+---------+------------+--------------------+----------------+--------------------+
        |                body|partition|     offset|sequenceNumber|        enqueuedTime|publisher|partitionKey|          properties|systemProperties|                 url|
        +--------------------+---------+-----------+--------------+--------------------+---------+------------+--------------------+----------------+--------------------+
        |[{"topic":"/subsc...|        0|30084343744|         55489|2021-03-03 14:21:...|     null|        null|[aeg-event-type -...|              []|http://path...|
        +--------------------+---------+-----------+--------------+--------------------+---------+------------+--------------------+----------------+--------------------+
    """

    """ Before .... 

    df = batchDf.select("url")
    url = df.collect()

    [createDF(item) for item in url]
    """
    # Now without collect()
    # Select the url field of the df
    url_select_df = batchDf.select("url")

    # Read url value
    result = url_select_df.rdd.map(lambda x: createDF(x.url))
  
query  = df \
    .writeStream \
    .foreachBatch(loadData) \
    .outputMode("update") \
    .queryName("test") \
    .start() \
    .awaitTermination()

However, when I want to extract the URL without collect, I get the following error message:

It appears that you are attempting to reference SparkContext from a broadcast.

What could be happening?

Thank you very much for your help

You're calling `spark.read` inside `foreachBatch`. I think that is not allowed — mck, Mar 03 '21 at 14:19
Thanks @mck. This means then that I have been able to make it work with ``collect()`` because I call SparkSession in the driver, right? — basigow, Mar 03 '21 at 14:34

score 1 · Accepted Answer · edited Mar 14 '21 at 11:12

1

Without the call of collect the Dataframe url_select_df is distributed across the executors. When you then call map, the lambda expression gets executed on the executors. Because the lambda expression is calling createDF which is using the SparkContext you get the exception as it is not possible to use the SparkContext on an executor.

It looks like you already figured out the solution which is to collect the dataframe to the driver and apply the lambda expression there.

Just make sure that you are not overloading (based on memory) your driver.

edited Mar 14 '21 at 11:12

marc_s

732,580
175
1,330
1,459

answered Mar 03 '21 at 15:07

Michael Heil

16,250
3
42
77

1

Many thanks for the explanation @mike :) I based this solution on one of your answers: https://stackoverflow.com/questions/65777481/read-file-path-from-kafka-topic-and-then-read-file-and-write-to-deltalake-in-str – basigow Mar 03 '21 at 16:18

It appears that you are attempting to reference SparkContext from a broadcast

1 Answers1

Linked