2

I'm trying to process some events with Spark Structured Streaming.

The incoming events looks like:

Event 1:

url
http://first/path/to/read/from...

Event 2:

url
http://second/path/to/read/from...

And so on.

My goal is to read each of these urls and generate a new DF. So far I've done it with a code like this where I did a collect().


def createDF(url):

    file_url = "abfss://" + container + "@" + az_storage_account + ".dfs.core.windows.net/" + az_storage_folder + "/" + url

    """ Read data """
    binary = spark.read.format("binaryFile").load(file_url)
    """ Do other operations """
    ...

    """ save the data """
    # write it into blob again

    return something

def loadData(batchDf, batchId):

    
    """
    batchDf:
        +--------------------+---------+-----------+--------------+--------------------+---------+------------+--------------------+----------------+--------------------+
        |                body|partition|     offset|sequenceNumber|        enqueuedTime|publisher|partitionKey|          properties|systemProperties|                 url|
        +--------------------+---------+-----------+--------------+--------------------+---------+------------+--------------------+----------------+--------------------+
        |[{"topic":"/subsc...|        0|30084343744|         55489|2021-03-03 14:21:...|     null|        null|[aeg-event-type -...|              []|http://path...|
        +--------------------+---------+-----------+--------------+--------------------+---------+------------+--------------------+----------------+--------------------+
    """

    """ Before .... 

    df = batchDf.select("url")
    url = df.collect()

    [createDF(item) for item in url]
    """
    # Now without collect()
    # Select the url field of the df
    url_select_df = batchDf.select("url")

    # Read url value
    result = url_select_df.rdd.map(lambda x: createDF(x.url))
  
query  = df \
    .writeStream \
    .foreachBatch(loadData) \
    .outputMode("update") \
    .queryName("test") \
    .start() \
    .awaitTermination()

However, when I want to extract the URL without collect, I get the following error message:

It appears that you are attempting to reference SparkContext from a broadcast.

What could be happening?

Thank you very much for your help

basigow
  • 145
  • 1
  • 11
  • You're calling `spark.read` inside `foreachBatch`. I think that is not allowed – mck Mar 03 '21 at 14:19
  • Thanks @mck. This means then that I have been able to make it work with ``collect()`` because I call SparkSession in the driver, right? – basigow Mar 03 '21 at 14:34

1 Answers1

1

Without the call of collect the Dataframe url_select_df is distributed across the executors. When you then call map, the lambda expression gets executed on the executors. Because the lambda expression is calling createDF which is using the SparkContext you get the exception as it is not possible to use the SparkContext on an executor.

It looks like you already figured out the solution which is to collect the dataframe to the driver and apply the lambda expression there.

Just make sure that you are not overloading (based on memory) your driver.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Michael Heil
  • 16,250
  • 3
  • 42
  • 77
  • 1
    Many thanks for the explanation @mike :) I based this solution on one of your answers: https://stackoverflow.com/questions/65777481/read-file-path-from-kafka-topic-and-then-read-file-and-write-to-deltalake-in-str – basigow Mar 03 '21 at 16:18