2

I am facing a problem with AWS Glue. The code imports two dataframes from 100s of small parquet files, using:

context.create_dynamic_frame_from_options(...)

The process completes successfully and the data is cleaned with null/duplicate values dropped. The code then runs a join between these two dataframes. The one dataframe has around 38 million entries and the other around 7 thousand.

An inner join is then executed and also concludes. After this, a NullpointerExcpetion is sometimes thrown on the exact same data, either getting a count() on the dataframe or writing to a single s3 location. I have tried changing the number of workers from 2 to 250, the type of workers, dropping certain columns, no difference. The memory in use is far under the 50% mark.

Has anyone experienced this before?

2021-06-09 13:10:30,219 ERROR [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logError(70)): Task 102 in stage 16.0 failed 4 times; aborting job
2021-06-09 13:10:30,516 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Error from Python:Traceback (most recent call last):
  File "/tmp/ap_sde_meta_inference_transform.py", line 365, in <module>
    main()
  File "/tmp/ap_sde_meta_inference_transform.py", line 356, in main
    write_out(context, uii, 'uii', options.uii_out_location, options.uii_out_format)
  File "/tmp/ap_sde_meta_inference_transform.py", line 171, in write_out
    format=destination_format)
  File "/opt/amazon/lib/python3.6/site-packages/awsglue/dynamicframe.py", line 640, in from_options
    format_options, transformation_ctx)
  File "/opt/amazon/lib/python3.6/site-packages/awsglue/context.py", line 242, in write_dynamic_frame_from_options
    format, format_options, transformation_ctx)
  File "/opt/amazon/lib/python3.6/site-packages/awsglue/context.py", line 265, in write_from_options
    return sink.write(frame_or_dfc)
  File "/opt/amazon/lib/python3.6/site-packages/awsglue/data_sink.py", line 35, in write
    return self.writeFrame(dynamic_frame_or_dfc, info)
  File "/opt/amazon/lib/python3.6/site-packages/awsglue/data_sink.py", line 31, in writeFrame
    return DynamicFrame(self._jsink.pyWriteDynamicFrame(dynamic_frame._jdf, callsite(), info), dynamic_frame.glue_ctx, dynamic_frame.name + "_errors")
  File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o710.pyWriteDynamicFrame.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 102 in stage 16.0 failed 4 times, most recent failure: Lost task 102.3 in stage 16.0 (TID 40644, 172.34.174.102, executor 2): java.lang.NullPointerException
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:207)
    at org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:65)
    at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:498)
    at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
    at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:105)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:131)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:418)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:352)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    ...
Jaco Van Niekerk
  • 4,180
  • 2
  • 21
  • 48

0 Answers0