I'm working on a project to bulk load data from a CSV file to HBase using Spark streaming. The code I'm using is as follows (adapted from here):
def bulk_load(rdd):
conf = {#removed for brevity}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)
load_rdd.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
Everything up to and including the two flatMap
s works as expected. However, when trying to execute saveAsNewAPIHadoopDataset
I get the following runtime error:
java.lang.ClassNotFoundException: org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter
I have set PYTHONPATH
to point to the jar containing this class (as well as my other converter class) however this does not seem to have improved the situation. Any advice would be greatly appreciated. Thanks in advance.