1

I keep getting

java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper

when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema when creating the DataFrame object, but if I execute any actions on the data it will always throw the NoClassDefFoundError when trying to instantiate the AvroWrapper. I've tried adding avro-mapred-1.8.0.jar in my $HDFS_USER/lib directory on the cluster and even included it using the --jar option when launching the shell. Neither of these options worked. Any advice would be greatly appreciated. Below is example code:

scala> import org.apache.spark.sql._
scala> import com.databricks.spark.avro._
scala> val sqc = new SQLContext(sc)
scala> val df = sqc.read.avro("my_avro_file") // recognizes the schema and creates the DataFrame object
scala> df.show // this is where I get NoClassDefFoundError
Psidom
  • 209,562
  • 33
  • 339
  • 356
Pudge
  • 98
  • 1
  • 6

2 Answers2

2

The DataFrame object itself is created at the val df =... line, but data is not read yet. Spark only starts reading and processing the data, when you ask for some kind of output (like a df.count(), or df.show()).

So the original issue is that the avro-mapred package is missing. Try launching your Spark Shell like this:

spark-shell --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1 The Spark Avro package marks the Avro Mapred package as provided, but it is not available on your system (or classpath) for one or other reason.

Daniel Zolnai
  • 16,487
  • 7
  • 59
  • 71
  • it appears to just sit there. won't progress beyond. `org.apache.avro#avro-mapred added as a dependency` `com.databricks#spark-avro_2.10 added as a dependency` `:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0` `confs: [default]` – Pudge Jun 10 '16 at 23:01
  • That's weird. Can you paste the entire command you are using to launch spark-shell? – Daniel Zolnai Jun 11 '16 at 07:51
  • Apologies for the delayed response I had to switch tasks for a couple days. Here's the command I'm running `spark-shell --master yarn --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1`. Also, not sure if it makes a difference, but we are running CDH 5.6. Thanks again. – Pudge Jun 14 '16 at 17:54
  • This might seem weird, but try with the `--packages` as the first argument and `--master` as the second. Also did you give it a try using the local master? – Daniel Zolnai Jun 14 '16 at 18:30
  • After waiting a considerable amount of time I finally got an error: `Server access error at url https://repo1.maven.org/maven2/org/apache/avro/avro-mapred/1.7.7/avro-mapred-1.7.7.pom (java.net.ConnectException: Connection timed out)`. I'm sure it has everything to do with our proxy. – Pudge Jun 14 '16 at 19:48
  • Some progress. Using `spark-shell --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=myproxy.com -Dhttp.proxyPort=9000 -Dhttps.proxyHost=myproxy.com -Dhttps.proxyPort=9000" --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1 --master yarn` I'm able to get the dependencies, but now it's failing on `[download failed: org.mortbay.jetty#jetty;6.1.26!jetty.zip]` – Pudge Jun 14 '16 at 20:35
  • I've resolved all the dependency issues, but I still get the same `java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper`. If I run locally it works fine, but that sort of defeats the purpose as it's a large amount of data I'm trying to analyze. At this point I'm switching tactics and am going to deploy instead of using the shell. Thanks for your help. – Pudge Jun 14 '16 at 21:07
0

If anyone else runs into this problem, I finally solved it. I removed the CDH spark package and downloaded it from http://spark.apache.org/downloads.html. After that everything worked fine. Not sure what the issues was with the CDH version, but I'm not going to waste anymore time trying to figure it out.

Pudge
  • 98
  • 1
  • 6