0

I am new to working with json data on hive. I am working on a spark application that gets json data and stores it into hive tables. I have a json like this:

Json of Jsons

which looks like this when expanded:

hierarchy

I am able to read the json into a dataframe and save it in a location on HDFS. But getting hive to be able to read the data is the tough part.

After I've searched online for example, I've tried to do this:

using the STRUCT for all the json fields and then access the elements using column.element.

For example:

web_app_security will be the name of a column(of type STRUCT) inside the table and the other jsons in it like config_web_cms_authentication, web_threat_intel_alert_external will also be Structs(with rating and rating_numeric as the fields).

I tried creating the table with json serde. here is my table definition:

CREATE EXTERNAL TABLE jsons (
web_app_security struct<config_web_cms_authentication: struct<rating: string, rating_numeric: float>, web_threat_intel_alert_external: struct<rating: string, rating_numeric: float>, web_http_security_headers: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>,
dns_security struct<domain_hijacking_protection: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float, dns_hosting_providers: struct<rating:string, rating_numeric: float>>,
email_security struct<rating: string, email_encryption_enabled: struct<rating: string, rating_numeric: float>, rating_numeric: float, email_hosting_providers: struct<rating: string, rating_numeric: float>, email_authentication: struct<rating: string, rating_numeric: float>>,
threat_intell struct<rating: string, threat_intel_alert_internal_3: struct<rating: string, rating_numeric: float>, threat_intel_alert_internal_1: struct<rating: string, rating_numeric: float>, rating_numeric: float,  threat_intel_alert_internal_12: struct<rating: string, rating_numeric: float>, threat_intel_alert_internal_6: struct<rating: string, rating_numeric: float>>,
data_loss struct<data_loss_6: struct<rating: string, rating_numeric: float>, rating: string, data_loss_36plus: struct<rating: string, rating_numeric: float>, rating_numeric: float,  data_loss_36: struct<rating: string, rating_numeric: float>, data_loss_12: struct<rating: string, rating_numeric: float>, data_loss_24: struct<rating: string, rating_numeric: float>>,
system_hosting struct<host_hosting_providers: struct<rating: string, rating_numeric: float>,  hosting_countries: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>,
defensibility struct<attack_surface_web_ip: struct<rating: string, rating_numeric: float>, shared_hosting: struct<rating: string, rating_numeric: float>, defensibility_hosting_providers: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float, attack_surface_web_hostname: struct<rating: string, rating_numeric: float>>,
software_patching struct<patching_web_cms: struct<rating: string, rating_numeric: float>, rating: string, patching_web_server: struct<rating: string, rating_numeric: float>, patching_vuln_open_ssl: struct<rating: string, rating_numeric: float>, patching_app_server: struct<rating: string, rating_numeric: float>, rating_numeric: float>,
governance struct<governance_customer_base: struct<rating: string, rating_numeric: float>, governance_security_certifications: struct<rating: string, rating_numeric: float>, governance_regulatory_requirements: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>
)ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS orc
LOCATION 'hdfs://nameservice1/data/gis/final/rr_current_analysis'

I've tried to parse the rows with the json serde. After I've saved some data to the table, I get the following error when I try to query it:

Error: java.io.IOException: java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.Text (state=,code=0)

I am not sure if I am doing it the right way.

I am open to any other ways of storing the data into the table as well. Any help would be appreciated. Thank you.

Community
  • 1
  • 1
Hemanth Annavarapu
  • 823
  • 3
  • 19
  • 37

1 Answers1

1

That's because you're mixing ORC as a storage (STORED AS orc) and JSON as a SerDe (ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe') overriding ORC's default OrcSerde SerDe, but not input (OrcInputFormat) and output (OrcOutputFormat) formats.

You either need to use ORC storage without overriding its default SerDe. In this case, make sure that your Spark application writes into ORC tables, not JSON.

Or, if you want data to be stored in JSON, use JsonSerDe together with a plain text file as a storage (STORED AS TEXTFILE).


Hive Developer Guide has explanation on how SerDe and Storage work -https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HiveSerDe

Sergey Khudyakov
  • 1,122
  • 1
  • 8
  • 15
  • Thanks for the answer. I treid to save my dataframe as a text file using `df.rdd.saveAsTextFile("Path")` but I get his error ` org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory already exists` I am not sure why it is trying to create a new directory for every dataframe instead of creating a new file in the given path. Is there a better way of saving the dataframe as text file? or Is there a way I can save the data frame as csv and give a proper table definition to read the csv files and use the json serde?? @Sergey Khudyakov – Hemanth Annavarapu Jul 17 '17 at 02:52
  • @HemanthAnnavarapu take a look at `df.write` instead, and especially `df.write.mode(SaveMode)`. I don't know why are you referring to CSV files now, but I strongly recommend you to read Hive Developer Guide (link in the answer) first and Spark DataFrame API documentation. The "better way" really depends on what do you want to achieve, what kind of tables do you want to have in Hive, etc. – Sergey Khudyakov Jul 17 '17 at 05:39
  • I am trying with csv now because I've searched for saving a dataframe as text and in most of the examples they are referring to using `DF.write.format("org.databricks.spark.csv").save("path")` . Since `Df.saveAsTextFile` doesn't work, I'm trying with csv. Is there a way I can still use the `serde` on `csv` files? Or is it possible for me to override `input format` by using the `orc` format? @Sergey Khudyakov – Hemanth Annavarapu Jul 17 '17 at 11:40
  • I was able to get it working by saving the `dataframe` as `json`. Didn't mention `STORED AS` in the table definition. – Hemanth Annavarapu Jul 17 '17 at 17:27
  • @HemanthAnnavarapu good that you solved it. Hive uses TEXTFILE storage by default. – Sergey Khudyakov Jul 17 '17 at 18:05