0

I am working with spark structured streaming, taking around 10M records of data from kafka topic, transforming it and saving to mysql. I am facing "java.lang.OutOfMemoryError: GC overhead limit exceeded" with spark, I want to limit the number of records to be read by spark streaming in one go so that out of memory error wont come again and again. Code which I am executing is as follows:

PySpark Configurations:

conf = SparkConf().setAppName("KafkaStreamingDataPipeline")\
        .setMaster("local[*]") \
        .set("spark.driver.cores","8")\
        .set("spark.driver.maxResultSize","8g")\
        .set("spark.driver.memory","15g")\
        .set("spark.executor.memory","8g")\
        .set("spark.executor.instances", "8")\
        .set("spark.executor.cores", "8")


spark = SparkSession.builder.master("local[*]").appName("Kafka Pyspark Streaming Learning").config(conf=conf).getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

ReadStream Code is as follows:

  orders_df = spark \
        .readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
        .option("subscribe", KAFKA_TOPIC_NAME) \
        .option("startingOffsets", "earliest") \
        .option("failOnDataLoss", "false") \
        .load()

and Lastly writestream code is as follows:

trans_detail_write_stream = dfTrans2.writeStream \
        .trigger(processingTime='25 seconds') \
        .outputMode("append") \
        .foreachBatch(saveToMysql) \
        .start()

Function saveToMysql():

def saveToMysql(currentDf, epocId):

    dbCredentials = {"user": mysql_user_name,
                     "password":mysql_password,
                     "driver":mysql_driver_class}

    print("Printing Epoc Id")
    print(epocId)

    processedAt = time.strftime("%Y-%m-%d %H:%M:%S")

    currentDfFinal = currentDf\
        .withColumn("processed_at",lit(processedAt))\
        .withColumn("batch_id", lit(epocId))

    print("Printing Before MySql Table Save: " + str(epocId))

    currentDfFinal \
        .write \
        .jdbc(url=mysqlJdbcUrl,
              table = mysql_table_name,
              mode = "append",
              properties = dbCredentials)

    print("Printing After MySql table Save " + str(epocId))

Can anyone assist me on how to readstream limited amount of data in pyspark so that it wont go out of memory.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • How much memory does your computer actually have? Add `spark.driver.memory + (spark.executor.instances * spark.executor.memory)`, and you will need a total of at least 80G of memory just to run this app (then even more to run Kafka, Spark History Server, etc)... Spark natively consumes Kafka in small batches (so you never get 10M all at once), so you do not need to have excessive memory settings configured here. What happens when you reduce those values? Your driver process especially should not need more than `512m` – OneCricketeer Jun 27 '23 at 16:19
  • My computer have 64 gb RAM, can you please elaborate what exactly numbers should I put in "spark.driver.memory + (spark.executor.instances * spark.executor.memory)", if I am having this much amount of memory. Secondly, I am transforming the data coming from kafka and joining it with other db tables as well, So that means I will need more than 100 GB of memory ? – Muhammad Affan Jul 03 '23 at 06:27
  • Like I said, your current values require 80G of memory. No wonder you're getting OOM. Just do the math, and make sure it totals less than your system – OneCricketeer Jul 05 '23 at 01:25

1 Answers1

0

you could try to repartition your df. keep increasing partitions until you stop getting oom errors.

dfTrans2.repartition(100).writeStream...