We have a glue job which reads data from a table in Aurora DB. We are using the below call to read from Aurora
df = self.glue_context.create_dynamic_frame.from_options(
connection_type="custom.jdbc",
connection_options={
"className": self.jdbc_driver_name,
"url": self.aurora_url,
"user": self.db_username,
"password": self.db_password,
"query": query,
"hashexpression": hash_expression,
"hashpartitions": hash_partition,
},
)
We are trying to convert this to a Spark Dataframe to Persist the data fetched from the Table.
targetdf = df.toDF()
tragetdf = tragetdf.select(col("col1").alias("col1"),
col("col2").alias("col2")
).repartition(int(partitions))
tragetdf.persist()
The data returned from the Aurora DB is huge (few millions ) and is stored in the Glue DynamicFrame. when we are trying to convert the DynamicFrame to Spark Dataframe it throws an timeout error. It works for small amount of data (~50k records) Can someone please suggest what is potentially going wrong or if there is any other better way to implement this scenario.
Error:
Traceback (most recent call last):
File \"/tmp/code.py\", line 857, in <module>
init()
File \"/tmp/code.py\", line 853, in init
main(args, glue_context, spark, current_path, bucket_name, file_path, fixed_path, i_output, final_output_path)
File \"/tmp/code.py\", line 362, in main
brk_df = fetch_from_aurora(args, glue_context, source_table, hash_expression_aurora, target_query)
File \"/tmp/code.py\", line 274, in fetch_from_aurora
df = intermtntdf.toDF()
File \"/opt/amazon/lib/python3.6/site-packages/awsglue/dynamicframe.py\", line 148, in toDF
return DataFrame(self._jdf.toDF(self.glue_ctx._jvm.PythonUtils.toSeq(scala_options)), self.glue_ctx)
File \"/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py\", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File \"/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py\", line 63, in deco
return f(*a, **kw)
File \"/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py\", line 328, in get_return_value
format(target_id, \".\", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o135.toDF.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times,
most recent failure: Lost task 0.3 in stage 8.0 (TID 11, 10.156.19.74, executor 12):
ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 648979 ms
Driver stacktrace:
\tat org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
\tat org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
\tat org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
\tat scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
\tat scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
\tat