0

I'm using spark version 3.3.1.5.2 in synapse notebook. I first read parquet data from azure storage account and do transformation. Finally, when I want to check the size of the pyspark dataframe (final_df) by running final_df.count(), it throws a Py4JJavaError error shown below:

----> 1 final_df.count() File /opt/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py:804, in DataFrame.count(self) 794 def count(self) -> int: 795 """Returns the number of rows in this :class:`DataFrame`. 796 797 .. versionadded:: 1.3.0 (...) 802 2 803 """ --> 804 return int(self._jdf.count()) File ~/cluster-env/env/lib/python3.10/site-packages/py4j/java_gateway.py:1321, in JavaMember.__call__(self, *args) 1315 command = proto.CALL_COMMAND_NAME +\ 1316 self.command_header +\ 1317 args_command +\ 1318 proto.END_COMMAND_PART 1320 answer = self.gateway_client.send_command(command) -> 1321 return_value = get_return_value( 1322 answer, self.gateway_client, self.target_id, self.name) 1324 for temp_arg in temp_args: 1325 temp_arg._detach() File /opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py:190, in capture_sql_exception.<locals>.deco(*a, **kw) 188 def deco(*a: Any, **kw: Any) -> Any: 189 try: --> 190 return f(*a, **kw) 191 except Py4JJavaError as e: 192 converted = convert_exception(e.java_exception) File ~/cluster-env/env/lib/python3.10/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value) 329 else: 330 raise Py4JError( 331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n". 332 format(target_id, ".", name, value)) Py4JJavaError: An error occurred while calling o4153.count. :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in
stage 31.0 failed 4 times, most recent failure: Lost task 5.3 in stage 31.0
(TID 249) (vm-3ac85017 executor 1):
org.apache.spark.sql.execution.QueryExecutionException: Encountered error while
reading file abfss://bronze@adlsblablah.dfs.core.windows.net/MongoDB/blah-blah-O-mongo/LDocument/parquet/2023/02/28/part-00000-bf66db6f-a839-45d0-8119-db3724ebdf63-c000.snappy.parquet.
Details: at
org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:731)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:314)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:135)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:764)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136) at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750) Caused by:
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in
block -1 in file abfss://bronze@adlsblablah.dfs.core.windows.net/MongoDB/blah-blah-O-mongo/LDocument/parquet/2023/02/28/part-00000-bf66db6f-a839-45d0-8119-db3724ebdf63-c000.snappy.parquet
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:264)
at
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator$$anon$1.hasNext(RecordReaderIterator.scala:61)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:135)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:305)
... 18 more Caused by: java.lang.ClassCastException: Expected instance of group
converter but got
"org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetStringConverter"
at org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:34) at
org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:267)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147) at
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109) at
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:177)
at
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:141)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:230)
... 23 more Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2682)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2618)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2617)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2617) at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1190)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1190)
at scala.Option.foreach(Option.scala:407) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1190)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2870)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2812)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2801)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) Caused by:
org.apache.spark.sql.execution.QueryExecutionException: Encountered error while
reading file abfss://bronze@adlsblablah.dfs.core.windows.net/MongoDB/blah-blah-O-mongo/LDocument/parquet/2023/02/28/part-00000-bf66db6f-a839-45d0-8119-db3724ebdf63-c000.snappy.parquet.
Details: at
org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:731)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:314)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:135)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:764)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136) at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750) Caused by:
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in
block -1 in file abfss://bronze@adlsblablah.dfs.core.windows.net/MongoDB/blah-blah-O-mongo/LDocument/parquet/2023/02/28/part-00000-bf66db6f-a839-45d0-8119-db3724ebdf63-c000.snappy.parquet
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:264)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator$$anon$1.hasNext(RecordReaderIterator.scala:61)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:135)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:305)
... 18 more Caused by: java.lang.ClassCastException: Expected instance of group
converter but got
"org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetStringConverter"
at org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:34) at
org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:267)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147) at
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109) at
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:177)
at
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:141)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:230)
... 23 more

I've referred to this post that had a similar problem but the solution seems to geared towards conda, which is not the virtual env I'm using (I'm on synapse using an azure spark pool)

I wonder if setting a spark configuration something like this spark.conf.set("spark.sql.caseSensitive" ,"true") would help prevent the error.

EDIT: 3/1/2023

Per @CRAFTY DBA's suggestion, I'm sharing the code that I'm using to read data:

account_name = f'adlsblablah{env}'
container_name = 'bronze'
relative_path = 'MongoDB/blah-blah-O-mongo/LDocument/parquet/*/*/*/'

adls_path = f'abfss://{container_name}@{account_name}.dfs.core.windows.net/{relative_path}'

parquet_path = f'{adls_path}/*'

Although the parquet files are partitioned, it's reading from each directory (specified by each date) and not the files so I don't think the error's thrown because of the probable error @CRATY DBA mentioned in (3).

Also, per @Sharma's catch, the error seems to stem from files in the .../2023/02/28/directory as shown in the error message. abfss://bronze@adlsblablah.dfs.core.windows.net/MongoDB/blah-blah-O-mongo/LDocument/parquet/2023/02/28/part-00000-bf66db6f-a839-45d0-8119-db3724ebdf63-c000.snappy.parquet. When I explicitly read a different directory (.../2023/02/09), .save() and .count() both ran successfully. I'm guessing that the partitioned file names in the directory needs to follow a similar naming with just a difference in their names indicating each partition 00000, 00001, ..., 00009.

Here is an image of the files in the directory that does NOT throw an error (top = bronze@adlsblablah.dfs.core.windows.net/MongoDB/blah-blah-O-mongo/LDocument/parquet/2023/02/22/) and that DOES throw an error (bottom = bronze@adlsblablah.dfs.core.windows.net/MongoDB/blah-blah-O-mongo/LDocument/parquet/2023/02/28/). Notice that top has files with the same name with just a difference indicating each partition, while the bottom has files with completely different names and don't belong to the same partitioned entity(not sure if it's the right term).

TOP

works fine

BOTTOM

throws error

EDIT: 3/2/2023

My hypothsis above that it writes successfully only when the files have similar names in the same directory proved to be wrong, as I was able to read files with different names in the same directory and write it. I'm still confused what it means by this error: Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file .

user9532692
  • 584
  • 7
  • 28
  • Just check for the keyword ```Caused by``` you will get the error details Ex: ```Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file abfss://bronze@adlsblablah.dfs.core.windows.net/MongoDB/blah-blah-O-mongo/LDocument/parquet/2023/02/28/part-00000-bf66db6f-a839-45d0-8119-db3724ebdf63-c000.snappy.parquet at``` – Sharma Mar 01 '23 at 16:27
  • check your malformed file. – Lamanus Mar 01 '23 at 23:42
  • @Lamanus Should I check how the schema of the file that threw an error differ from other files? What other things can I check about the malformed file compared to other files? – user9532692 Mar 03 '23 at 05:31

1 Answers1

0

There is not enough information to figure this one out. Here are some suggestions.

1 - Download the file to your PC. Visual Code has a plug in for parquet. Can you load the file into the editor? If you can not, then it is a file creation issue.

2 - Can you supply the code that is trying to perform the read.

3 - Love you folder naming convention. Are you sure this is not a partitioned file? If so, read the directory not the file.

abfss://bronze@adlsblablah.dfs.core.windows.net/MongoDB/blah-blah-O-mongo/LDocument/parquet/2023/02/28/part-00000-bf66db6f-a839-45d0-8119-db3724ebdf63-c000.snappy.parquet

4 - One common mistake with Parquet is the use of special characters " ,;{}()\n\t=". Usually this prevents the file from being created ...

5 - Please post the code you are using and a screen shot of the parquet file from VS code.

Please post answers so that we can continue the debugging effort.

CRAFTY DBA
  • 14,351
  • 4
  • 26
  • 30
  • I'm still confused why it raises an error for a certain file saying `Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file `. Should I check how the schema of the file differ from other files? What other things can I check about the malformed file compared to other files? – user9532692 Mar 03 '23 at 05:32