1

I'm working on a project to bulk load data from a CSV file to HBase using Spark streaming. The code I'm using is as follows (adapted from here):

def bulk_load(rdd):
    conf = {#removed for brevity}

    keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
    valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"

    load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
                  .flatMap(csv_to_key_value)
    load_rdd.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)

Everything up to and including the two flatMaps works as expected. However, when trying to execute saveAsNewAPIHadoopDataset I get the following runtime error:

java.lang.ClassNotFoundException: org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter

I have set PYTHONPATH to point to the jar containing this class (as well as my other converter class) however this does not seem to have improved the situation. Any advice would be greatly appreciated. Thanks in advance.

Community
  • 1
  • 1
swinefish
  • 559
  • 6
  • 22

1 Answers1

2

Took some digging, but here's the solution:

The jars did not need to be added to PYTHONPATH as I thought, but rather to the Spark config. I added to following properties to the config (Custom spark-defaults under Ambari) spark.driver.extraClassPath and spark.executor.extraClassPath

To each of these I added the following jars:

/usr/hdp/2.3.2.0-2950/spark/lib/spark-examples-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar
/usr/hdp/2.3.2.0-2950/hbase/lib/hbase-common-1.1.2.2.3.2.0-2950.jar
/usr/hdp/2.3.2.0-2950/hbase/lib/hbase-client-1.1.2.2.3.2.0-2950.jar
/usr/hdp/2.3.2.0-2950/hbase/lib/hbase-protocol-1.1.2.2.3.2.0-2950.jar
/usr/hdp/2.3.2.0-2950/hbase/lib/guava-12.0.1.jar
/usr/hdp/2.3.2.0-2950/hbase/lib/hbase-server-1.1.2.2.3.2.0-2950.jar

Adding these jars has allowed spark to see all the necessary files.

swinefish
  • 559
  • 6
  • 22