0

I am getting the following errors while running Spark on Hadoop It is also giving me this error when I use the Scala API. I was convinced that the error is related to Spark paths and CLASSPATH.

The error is:

stage 0.0 failed 4 times; aborting job 17/04/25 13:36:53 WARN TaskSetManager: Lost task 11.2 in stage 0.0 (TID 20, 10.98.92.150, executor 4): TaskKilled (killed intentionally) 17/04/25 13:36:53 WARN TaskSetManager: Lost task 13.1 in stage 0.0 (TID 21, 10.98.92.150, executor 4): TaskKilled (killed intentionally) 17/04/25 13:36:53 WARN TaskSetManager: Lost task 6.0 in stage 0.0 (TID 6, 10.98.92.150, executor 5): TaskKilled (killed intentionally) 17/04/25 13:36:53 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.98.92.150, executor 5): TaskKilled (killed intentionally) Traceback (most recent call last): File "/opt/apache/spark/spark-2.1.0-bin-hadoop2.7/examples/src/main/python/test.py", line 8, in hive_context.sql("select count(1) from src_tmp").show() File "/opt/apache/spark/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 318, in show File "/opt/apache/spark/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call File "/opt/apache/spark/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/opt/apache/spark/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o46.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage 0.0 (TID 23, 10.98.92.151, executor 2): java.lang.NoClassDefFoundError: Could not initialize class org.apache.parquet.CorruptStatistics at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatisticsInternal(ParquetMetadataConverter.java:346) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:360) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:816) at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:793) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:502) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:461) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:417) at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:107) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:351) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:150) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112) at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795) at org.apache.spark.sql.Dataset.head(Dataset.scala:2112) at org.apache.spark.sql.Dataset.take(Dataset.scala:2327) at org.apache.spark.sql.Dataset.showString(Dataset.scala:248) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.parquet.CorruptStatistics at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatisticsInternal(ParquetMetadataConverter.java:346) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:360) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:816) at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:793) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:502) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:461) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:417) at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:107) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:351) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:150) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more

T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
xyz
  • 11
  • 2

1 Answers1

4

EDIT The original answer was actually wrong. Sure, the "classpath first" property is useful in some cases, but not in this one. Thanks for the votes but they are not deserved.  :-(

java.lang.NoSuchMethodError: org.apache.parquet.SemanticVersion.(IIILjava/lang/String;Ljava/lang/String;Ljava/lang/String;)V
at org.apache.parquet.CorruptStatistics

You are using Spark 2.1.0, which bundles Parquet V1.8.1, which defines its classes in package org.apache.parquet.


Original answer (which missed the point)

Your Hadoop distro seems to be from Cloudera, and (for example) CDH 5.10 bundles Parquet V1.5.0, which defines its classes in package parquet.

unzip -l $SPARK_HOME/jars/parquet-column-1.8.1.jar | grep CorruptStatistics.class
     3507  07-17-2015 13:56   org/apache/parquet/CorruptStatistics.class

unzip -l /opt/cloudera/parcels/CDH/jars/parquet-common-1.5.0-cdh5.10.0.jar | grep SemanticVersion.class
     5406  01-20-2017 11:57   parquet/SemanticVersion.class

So these versions are clearly incompatible.

EDIT So incompatible that they do not interfere, simply because the package is different.

When you run Spark executors under YARN, by default, the CLASSPATH contains the JARs from both versions of Parquet in random order, with catastrophic results.

Workaround: make sure your Spark JARs have precedence in the CLASSPATH with either

  • a command-line option on each execution
    --conf spark.yarn.user.classpath.first=true
  • or a global entry in spark-defaults.conf
    spark.yarn.user.classpath.first true


Better analysis with no real solution at this point (sorry)

The "NoSuchMethodError" complains that it could not find, at run-time, a method that was present at compile time.
That's for class SemanticVersion and a method with no name -- which is clearly wrong, even a constructor should be be marked with .<init> or sthg similar -- so I assume the error message got truncated, maybe because of < character being swallowed by S.O. message editor when you pasted it.

The method details: arguments (int, int, int, String, String, String) and a return type void. See that post for reference.

OK, let's assume class CorruptStatistics was compiled with a call to new SemanticVersion(1, 2, 3, "a", "b", "c") which was valid at compile time, but for some reason, when SemanticVersion was compiled, that constructor was not present (release mismatch?!)

That's insane, because the "official" source code (cf. apache GIT repo under "parquet-column" and "parquet-common") shows no trace of such a constructor, never, ever. Actually the CorruptStatistics is a bug fix for compatibility with some buggy Parquet formats, and SemanticVersion has just two constructors and no String in these.
Some "non-official" (but easier to read) source code for V1.8.1 can be found here and here.

Bottom line: all that makes no sense, unless

  • Spark 2.1.0 ships with Parquet JARs that are somehow inconsistent (and nobody found that bug yet!?!)
  • or you built a custom JAR, embedding Parquet classes that you have customised -- or a rogue fork of Parquet invoked in a rogue POM (??)
  • or you have deployed an exotic Cloudera parcel that places custom Parquet JARs in the CLASSPATH (but the "user classpath first" trick should have fixed that - unless you have these exotic JARs explicitly in spark.executor.extraClassPath)

To solve that mistery, I strongly suggest that you inspect all JARs that are susceptible to be present at run-time in the YARN CLASSPATH, including your custom JARs + Spark JARs + Cloudera CDH JARs + Cloudera extra parcels JARs, searching for any reference of CorruptStatistics.class -- you have the example unzip -l | grep command for that; wrap it in a loop, and be ready for surprises.

Community
  • 1
  • 1
Samson Scharfrichter
  • 8,884
  • 1
  • 17
  • 36
  • Thank you for your response,Samson.I understand the problem now. But I tried doing the global entry and it is still giving me the same error. Anything else I should try? – xyz Apr 25 '17 at 19:14
  • Would it help if I was using a different version of Spark? – xyz Apr 26 '17 at 15:03
  • @xyz, I have no clue. If you don't find the root cause -- by getting back to the original error message *(which got truncated from your post)* and by scanning all possible JARs in your system to find duplicates of the classes involved (caller + called) then you can try to randomly change a few things, but I'm very pessimistic about the outcome. – Samson Scharfrichter Apr 27 '17 at 20:31