I've got JSON data whose structure is similar to this:
root
|-- TX: struct (nullable = true)
| |-- ARGS: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- COOKIE: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- name: string (nullable = true)
| | | |-- value: string (nullable = true)
The way the data is actually formatted, the ARGS
array contains structs (a name-value pair just like the COOKIE
array), like this:
ARGS: {"name": "url", "value": "/index.html"}
However, in some rows of the file, the ARGS
array contains only a null value, which forces Hive's interpretation of it as a string:
ARGS: null
Since I know that all of the non-null values in ARGS
are structs, I'm using a CREATE
statement in Hive like this:
CREATE EXTERNAL TABLE ${db}.${table}(
tx struct<
args:array<struct<name:string,value:string>>,
cookie:array<struct<name:string,value:string>>
>
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '${uri}';
Since the ultimate goal here is to lateralize the array of key value pairs, I'm using a query like this to test things out:
SELECT array_pairs.name, array_pairs.value
FROM ${db}.${table}
LATERAL VIEW EXPLODE(tx.args) EXPLODED_TABLE AS array_pairs;
This works fine if I just view the first 100 or so rows in Hue. But if I try and export it as a CSV or use a WHERE
clause that reads the entire table, Hive pops this error due to the data type mismatch:
errorMessage='java.lang.Error: Data is not JSONObject but java.lang.String with value alpha'
I know that the best answer is to have a perfect data source, but in this case, the data is what it is, and I've got to parse it. Any tips on how to deal with null values in a JSON with Hive parsing?
EDIT 10/24/2019 07:43
Thanks to @leftjoin I realized that apart from the null values, there are one or two JSON objects where the ARGS
field is populated by a single string within the array:
ARGS: ["string value"]
Given this, is there any way that these strings can be ignored by the Hive parser, so that the key-value pairs can be extracted?