I'm using spark version 3.3.1.5.2 in synapse notebook. I first read parquet data from azure storage account and do transformation. Finally, when I want to check the size of the pyspark dataframe (final_df) by running final_df.count()
, it throws a Py4JJavaError error shown below:
----> 1 final_df.count() File /opt/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py:804, in DataFrame.count(self) 794 def count(self) -> int: 795 """Returns the number of rows in this :class:`DataFrame`. 796 797 .. versionadded:: 1.3.0 (...) 802 2 803 """ --> 804 return int(self._jdf.count()) File ~/cluster-env/env/lib/python3.10/site-packages/py4j/java_gateway.py:1321, in JavaMember.__call__(self, *args) 1315 command = proto.CALL_COMMAND_NAME +\ 1316 self.command_header +\ 1317 args_command +\ 1318 proto.END_COMMAND_PART 1320 answer = self.gateway_client.send_command(command) -> 1321 return_value = get_return_value( 1322 answer, self.gateway_client, self.target_id, self.name) 1324 for temp_arg in temp_args: 1325 temp_arg._detach() File /opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py:190, in capture_sql_exception.<locals>.deco(*a, **kw) 188 def deco(*a: Any, **kw: Any) -> Any: 189 try: --> 190 return f(*a, **kw) 191 except Py4JJavaError as e: 192 converted = convert_exception(e.java_exception) File ~/cluster-env/env/lib/python3.10/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value) 329 else: 330 raise Py4JError( 331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n". 332 format(target_id, ".", name, value)) Py4JJavaError: An error occurred while calling o4153.count. :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in
stage 31.0 failed 4 times, most recent failure: Lost task 5.3 in stage 31.0
(TID 249) (vm-3ac85017 executor 1):
org.apache.spark.sql.execution.QueryExecutionException: Encountered error while
reading file abfss://bronze@adlsblablah.dfs.core.windows.net/MongoDB/blah-blah-O-mongo/LDocument/parquet/2023/02/28/part-00000-bf66db6f-a839-45d0-8119-db3724ebdf63-c000.snappy.parquet.
Details: at
org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:731)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:314)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:135)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:764)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136) at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750) Caused by:
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in
block -1 in file abfss://bronze@adlsblablah.dfs.core.windows.net/MongoDB/blah-blah-O-mongo/LDocument/parquet/2023/02/28/part-00000-bf66db6f-a839-45d0-8119-db3724ebdf63-c000.snappy.parquet
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:264)
at
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator$$anon$1.hasNext(RecordReaderIterator.scala:61)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:135)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:305)
... 18 more Caused by: java.lang.ClassCastException: Expected instance of group
converter but got
"org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetStringConverter"
at org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:34) at
org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:267)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147) at
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109) at
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:177)
at
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:141)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:230)
... 23 more Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2682)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2618)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2617)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2617) at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1190)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1190)
at scala.Option.foreach(Option.scala:407) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1190)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2870)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2812)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2801)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) Caused by:
org.apache.spark.sql.execution.QueryExecutionException: Encountered error while
reading file abfss://bronze@adlsblablah.dfs.core.windows.net/MongoDB/blah-blah-O-mongo/LDocument/parquet/2023/02/28/part-00000-bf66db6f-a839-45d0-8119-db3724ebdf63-c000.snappy.parquet.
Details: at
org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:731)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:314)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:135)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:764)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136) at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750) Caused by:
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in
block -1 in file abfss://bronze@adlsblablah.dfs.core.windows.net/MongoDB/blah-blah-O-mongo/LDocument/parquet/2023/02/28/part-00000-bf66db6f-a839-45d0-8119-db3724ebdf63-c000.snappy.parquet
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:264)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator$$anon$1.hasNext(RecordReaderIterator.scala:61)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:135)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:305)
... 18 more Caused by: java.lang.ClassCastException: Expected instance of group
converter but got
"org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetStringConverter"
at org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:34) at
org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:267)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147) at
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109) at
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:177)
at
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:141)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:230)
... 23 more
I've referred to this post that had a similar problem but the solution seems to geared towards conda, which is not the virtual env I'm using (I'm on synapse using an azure spark pool)
I wonder if setting a spark configuration something like this spark.conf.set("spark.sql.caseSensitive" ,"true")
would help prevent the error.
EDIT: 3/1/2023
Per @CRAFTY DBA's suggestion, I'm sharing the code that I'm using to read data:
account_name = f'adlsblablah{env}'
container_name = 'bronze'
relative_path = 'MongoDB/blah-blah-O-mongo/LDocument/parquet/*/*/*/'
adls_path = f'abfss://{container_name}@{account_name}.dfs.core.windows.net/{relative_path}'
parquet_path = f'{adls_path}/*'
Although the parquet files are partitioned, it's reading from each directory (specified by each date) and not the files so I don't think the error's thrown because of the probable error @CRATY DBA mentioned in (3).
Also, per @Sharma's catch, the error seems to stem from files in the .../2023/02/28/
directory as shown in the error message. abfss://bronze@adlsblablah.dfs.core.windows.net/MongoDB/blah-blah-O-mongo/LDocument/parquet/2023/02/28/part-00000-bf66db6f-a839-45d0-8119-db3724ebdf63-c000.snappy.parquet
. When I explicitly read a different directory (.../2023/02/09
), .save()
and .count()
both ran successfully. I'm guessing that the partitioned file names in the directory needs to follow a similar naming with just a difference in their names indicating each partition 00000, 00001, ..., 00009.
Here is an image of the files in the directory that does NOT throw an error (top = bronze@adlsblablah.dfs.core.windows.net/MongoDB/blah-blah-O-mongo/LDocument/parquet/2023/02/22/
) and that DOES throw an error (bottom = bronze@adlsblablah.dfs.core.windows.net/MongoDB/blah-blah-O-mongo/LDocument/parquet/2023/02/28/
). Notice that top has files with the same name with just a difference indicating each partition, while the bottom has files with completely different names and don't belong to the same partitioned entity(not sure if it's the right term).
TOP
BOTTOM
EDIT: 3/2/2023
My hypothsis above that it writes successfully only when the files have similar names in the same directory proved to be wrong, as I was able to read files with different names in the same directory and write it. I'm still confused what it means by this error: Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file
.