3

I have a spark master & worker running in Docker containers with spark 2.0.2 and hadoop 2.7. I'm trying to submit a job from pyspark from a different container (same network) by running

df = spark.read.json("/data/test.json")
df.write.format("com.databricks.spark.avro").save("/data/test.avro")

But I'm getting this error:

java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;

It makes no difference if I try interactively or with spark-submit. These are my loaded packages in spark:

com.databricks#spark-avro_2.11;3.2.0 from central in [default]
com.thoughtworks.paranamer#paranamer;2.7 from central in [default]
org.apache.avro#avro;1.8.1 from central in [default]
org.apache.commons#commons-compress;1.8.1 from central in [default]
org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
org.codehaus.jackson#jackson-mapper-asl;1.9.13 from central in [default]
org.slf4j#slf4j-api;1.7.7 from central in [default]
org.tukaani#xz;1.5 from central in [default]
org.xerial.snappy#snappy-java;1.1.1.3 from central in [default]

spark-submit --version output:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.2
      /_/

Branch 
Compiled by user jenkins on 2016-11-08T01:39:48Z
Revision 
Url 
Type --help for more information.

scala version is 2.11.8

My pyspark command:

PYSPARK_PYTHON=ipython /usr/spark-2.0.2/bin/pyspark --master spark://master:7077 --packages com.databricks:spark-avro_2.11:3.2.0,org.apache.avro:avro:1.8.1

My spark-submit command:

spark-submit script.py --master spark://master:7077 --packages com.databricks:spark-avro_2.11:3.2.0,org.apache.avro:avro:1.8.1

I've read here that this can be caused by "an older version of avro being used" so I tried using 1.8.1, but I keep getting the same error. Reading avro works fine. Any help?

arinarmo
  • 375
  • 1
  • 11
  • 1
    That was a mistake of mine, `script.py` should go after the `spark-submit` parameters, but is not the cause of the error. The app is indeed being registered in the Spark Web UI. I already found the problem and a solution and will be posting it soon. Basically, Hadoop includes an avro (1.7.4) library which can get used instead of the desired one if the classpath is not set properly. – arinarmo Apr 03 '17 at 21:15
  • Can you please post how you solved the problem in the end? I encounter the same issue. – hiddenbit Apr 21 '17 at 14:58
  • 1
    Just posted my solution – arinarmo Apr 25 '17 at 21:22

2 Answers2

3

The cause of this error is that a apache avro version 1.7.4 is included in hadoop by default, and if the SPARK_DIST_CLASSPATH env variable includes the hadoop common ($HADOOP_HOME/share/common/lib/ ) before the ivy2 jars, the wrong version can get used instead of the version required by spark-avro (>=1.7.6) and installed in ivy2.

To check if this is the case, open a spark-shell and run

sc.getClass().getResource("/org/apache/avro/generic/GenericData.class")

This should tell you the location of the class like so:

java.net.URL = jar:file:/lib/ivy/jars/org.apache.avro_avro-1.7.6.jar!/org/apache/avro/generic/GenericData.class

If that class is pointing to $HADOOP_HOME/share/common/lib/ then you must simply include your ivy2 jars before the hadoop common in the SPARK_DIST_CLASSPATH env variable.

For example, in a Dockerfile

ENV SPARK_DIST_CLASSPATH="/home/root/.ivy2/*:$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*"

Note: /home/root/.ivy2 is the default location for ivy2 jars, you can manipulate that by setting spark.jars.ivy in your spark-defaults.conf, which is probably a good idea.

arinarmo
  • 375
  • 1
  • 11
0

I have encountered a similar problem before. Try using --jars {path to spark-avro_2.11-3.2.0.jar} option in spark-submit

shants
  • 612
  • 1
  • 7
  • 12