I am new to working with json data on hive. I am working on a spark application that gets json data and stores it into hive tables. I have a json like this:
which looks like this when expanded:
I am able to read the json into a dataframe and save it in a location on HDFS. But getting hive to be able to read the data is the tough part.
After I've searched online for example, I've tried to do this:
using the STRUCT
for all the json fields and then access the elements using column.element
.
For example:
web_app_security
will be the name of a column(of type STRUCT
) inside the table and the other jsons in it like config_web_cms_authentication, web_threat_intel_alert_external
will also be Structs(with rating
and rating_numeric
as the fields).
I tried creating the table with json serde. here is my table definition:
CREATE EXTERNAL TABLE jsons (
web_app_security struct<config_web_cms_authentication: struct<rating: string, rating_numeric: float>, web_threat_intel_alert_external: struct<rating: string, rating_numeric: float>, web_http_security_headers: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>,
dns_security struct<domain_hijacking_protection: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float, dns_hosting_providers: struct<rating:string, rating_numeric: float>>,
email_security struct<rating: string, email_encryption_enabled: struct<rating: string, rating_numeric: float>, rating_numeric: float, email_hosting_providers: struct<rating: string, rating_numeric: float>, email_authentication: struct<rating: string, rating_numeric: float>>,
threat_intell struct<rating: string, threat_intel_alert_internal_3: struct<rating: string, rating_numeric: float>, threat_intel_alert_internal_1: struct<rating: string, rating_numeric: float>, rating_numeric: float, threat_intel_alert_internal_12: struct<rating: string, rating_numeric: float>, threat_intel_alert_internal_6: struct<rating: string, rating_numeric: float>>,
data_loss struct<data_loss_6: struct<rating: string, rating_numeric: float>, rating: string, data_loss_36plus: struct<rating: string, rating_numeric: float>, rating_numeric: float, data_loss_36: struct<rating: string, rating_numeric: float>, data_loss_12: struct<rating: string, rating_numeric: float>, data_loss_24: struct<rating: string, rating_numeric: float>>,
system_hosting struct<host_hosting_providers: struct<rating: string, rating_numeric: float>, hosting_countries: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>,
defensibility struct<attack_surface_web_ip: struct<rating: string, rating_numeric: float>, shared_hosting: struct<rating: string, rating_numeric: float>, defensibility_hosting_providers: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float, attack_surface_web_hostname: struct<rating: string, rating_numeric: float>>,
software_patching struct<patching_web_cms: struct<rating: string, rating_numeric: float>, rating: string, patching_web_server: struct<rating: string, rating_numeric: float>, patching_vuln_open_ssl: struct<rating: string, rating_numeric: float>, patching_app_server: struct<rating: string, rating_numeric: float>, rating_numeric: float>,
governance struct<governance_customer_base: struct<rating: string, rating_numeric: float>, governance_security_certifications: struct<rating: string, rating_numeric: float>, governance_regulatory_requirements: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>
)ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS orc
LOCATION 'hdfs://nameservice1/data/gis/final/rr_current_analysis'
I've tried to parse the rows with the json serde. After I've saved some data to the table, I get the following error when I try to query it:
Error: java.io.IOException: java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.Text (state=,code=0)
I am not sure if I am doing it the right way.
I am open to any other ways of storing the data into the table as well. Any help would be appreciated. Thank you.