0

I have a pyspark script in a Zeppelin notebook, which I point at a JSON file sitting in BLOB storage, in order to infer the JSON schema and create an external table in Hive.

I can take the SQL command printed from the script, and execute it in a separate paragraph and the table creates just fine, however when I try to create the table through sqlcontext.sql() method I get the below error;

AnalysisException: u'org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found);'

Googling this error just brings up pages ensuring the JAR file for the SerDe is on the server, which it is, obviously, as I can create this table manually. Below is my script;

%spark2.pyspark

import os
import datetime as dt
import time
from datetime import date
from pyspark.sql.functions import monotonically_increasing_id, lit
from pyspark.sql.types import *
from pyspark.sql import *
from pyspark.sql.functions import split, lower, unix_timestamp, from_unixtime

hiveDbName = 'dev_phoenix'
hiveTableName = 'et_engagement_cac'
serdeName = 'org.openx.data.jsonserde.JsonSerDe'
jsonFileLocation = 'wasbs://blah-blah-blah@meh-meh-meh.blob.core.windows.net/dev/data/Engagement'

jsonDf = sqlContext.read.json("wasbs://blah-blah-blah@meh-meh-meh.blob.core.windows.net/dev/data/Engagement/Engagement.json")

# jsonDf.printSchema()

extTableDDL = "create external table " + hiveDbName + "." + hiveTableName + "(\n"

for col in jsonDf.dtypes:
    extTableDDL += '`' + col[0] + '` ' + col[1].replace('_id','`_id`') + ',\n'

extTableDDL = extTableDDL[:-2]
extTableDDL += ')\nrow format serde \'' + serdeName + '\'\n'
extTableDDL += 'location \'' + jsonFileLocation + '\'\n'
extTableDDL += 'tblproperties (\'serialization.null.format\'=\'\')'

print extTableDDL

sqlContext.sql(extTableDDL)

I intentionally obfuscated our WASB container name, hence the blah/meh.

I have found some posts that are making me start to think there are limitations to the types of tables you can create with sqlcontext.sql, and perhaps what I am trying to do is not possible?

I was able to successfully create the table when I took the SerDe declaration out, however Hive used a default SerDe which won't work for the data I have in the underlying file.

dtolnay
  • 9,621
  • 5
  • 41
  • 62

1 Answers1

0

OK, so I think I figured out what's happening, and how to fix it. I suspect the JAR file for the SerDe I am trying to use, is in a directory on the server which is not in the classpath variable.

So, I made a first call to spark.sql(...) to add the JAR, and now it is working. See updated script below;

%spark2.pyspark

import os
import datetime as dt
import time
from datetime import date
from pyspark.sql.functions import monotonically_increasing_id, lit
from pyspark.sql.types import *
from pyspark.sql import *
from pyspark.sql.functions import split, lower, unix_timestamp, from_unixtime

hiveDbName = 'dev_phoenix'
hiveTableName = 'et_engagement_cac'
serdeName = 'org.openx.data.jsonserde.JsonSerDe'
jsonFileLocation = 'wasbs://blah-blah-blah@meh-meh-meh.blob.core.windows.net/dev/data/Engagement'

jsonDf = spark.read.json("wasbs://blah-blah-blah@meh-meh-meh.blob.core.windows.net/dev/data/Engagement/Engagement.json")

# jsonDf.printSchema()

spark.sql('add jar /usr/hdp/current/hive-client/lib/json-serde-1.3.8-jar-with-dependencies.jar')

extTableDDL = "create external table " + hiveDbName + "." + hiveTableName + "(\n"

for col in jsonDf.dtypes:
    extTableDDL += '`' + col[0] + '` ' + col[1].replace('_id','`_id`').replace('_class','`_class`') + ',\n'

extTableDDL = extTableDDL[:-2]
extTableDDL += ')\nROW FORMAT SERDE\n'
extTableDDL += '   \'' + serdeName + '\'\n'
extTableDDL += 'STORED AS INPUTFORMAT\n'
extTableDDL += '   \'org.apache.hadoop.mapred.TextInputFormat\'\n'
extTableDDL += 'OUTPUTFORMAT\n'
extTableDDL += '   \'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\'\n'

extTableDDL += 'location \'' + jsonFileLocation + '\'\n'
extTableDDL += 'tblproperties (\'serialization.null.format\'=\'\')'

print extTableDDL

spark.sql(extTableDDL)