1

We have a glue job which reads data from a table in Aurora DB. We are using the below call to read from Aurora

df = self.glue_context.create_dynamic_frame.from_options(
            connection_type="custom.jdbc",
            connection_options={
                "className": self.jdbc_driver_name,
                "url": self.aurora_url,
                "user": self.db_username,
                "password": self.db_password,
                "query": query,
                "hashexpression": hash_expression,
                "hashpartitions": hash_partition,
            },
        )
      

We are trying to convert this to a Spark Dataframe to Persist the data fetched from the Table.

     targetdf = df.toDF()
     tragetdf = tragetdf.select(col("col1").alias("col1"),
                                                 col("col2").alias("col2")
                                                 ).repartition(int(partitions))
    tragetdf.persist()

The data returned from the Aurora DB is huge (few millions ) and is stored in the Glue DynamicFrame. when we are trying to convert the DynamicFrame to Spark Dataframe it throws an timeout error. It works for small amount of data (~50k records) Can someone please suggest what is potentially going wrong or if there is any other better way to implement this scenario.

Error:

    Traceback (most recent call last):
  File \"/tmp/code.py\", line 857, in <module>
    init()
  File \"/tmp/code.py\", line 853, in init
    main(args, glue_context, spark, current_path, bucket_name, file_path, fixed_path, i_output, final_output_path)
  File \"/tmp/code.py\", line 362, in main
    brk_df = fetch_from_aurora(args, glue_context, source_table, hash_expression_aurora, target_query)
  File \"/tmp/code.py\", line 274, in fetch_from_aurora
    df = intermtntdf.toDF()
  File \"/opt/amazon/lib/python3.6/site-packages/awsglue/dynamicframe.py\", line 148, in toDF
    return DataFrame(self._jdf.toDF(self.glue_ctx._jvm.PythonUtils.toSeq(scala_options)), self.glue_ctx)
  File \"/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py\", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File \"/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py\", line 63, in deco
    return f(*a, **kw)
  File \"/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py\", line 328, in get_return_value
    format(target_id, \".\", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o135.toDF.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, 
most recent failure: Lost task 0.3 in stage 8.0 (TID 11, 10.156.19.74, executor 12):
 ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 648979 ms
Driver stacktrace:
\tat org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
\tat org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
\tat org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
\tat scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
\tat scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
\tat 
Nagesh
  • 434
  • 1
  • 8
  • 26

1 Answers1

0

Most probably you are causing an OOM (out of memory) exception on the executor, which is the most common cause for the heartbeat timeout error, since toDF() is an expensive process.

From my point of view, you could try to:

  • break the aurora fetchs on smaller batchs on some sort of loop (don't forget to unpersist the data after each iteration);
  • increase the Glue worker size and its maximum DPU (i think 10 is the limit);
  • twink with the executor max memory, you can find some guindance in this question: AWS Glue executor memory limit