1

I've got JSON data whose structure is similar to this:

root
 |-- TX: struct (nullable = true)
 |    |-- ARGS: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- COOKIE: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- value: string (nullable = true)

The way the data is actually formatted, the ARGS array contains structs (a name-value pair just like the COOKIE array), like this:

ARGS: {"name": "url", "value": "/index.html"}

However, in some rows of the file, the ARGS array contains only a null value, which forces Hive's interpretation of it as a string:

ARGS: null

Since I know that all of the non-null values in ARGS are structs, I'm using a CREATE statement in Hive like this:

CREATE EXTERNAL TABLE ${db}.${table}(
    tx struct<
        args:array<struct<name:string,value:string>>,
        cookie:array<struct<name:string,value:string>>
    >
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '${uri}';

Since the ultimate goal here is to lateralize the array of key value pairs, I'm using a query like this to test things out:

SELECT array_pairs.name, array_pairs.value 
FROM ${db}.${table}
LATERAL VIEW EXPLODE(tx.args) EXPLODED_TABLE AS array_pairs;

This works fine if I just view the first 100 or so rows in Hue. But if I try and export it as a CSV or use a WHERE clause that reads the entire table, Hive pops this error due to the data type mismatch:

errorMessage='java.lang.Error: Data is not JSONObject but java.lang.String with value alpha'

I know that the best answer is to have a perfect data source, but in this case, the data is what it is, and I've got to parse it. Any tips on how to deal with null values in a JSON with Hive parsing?

EDIT 10/24/2019 07:43

Thanks to @leftjoin I realized that apart from the null values, there are one or two JSON objects where the ARGS field is populated by a single string within the array:

ARGS: ["string value"]

Given this, is there any way that these strings can be ignored by the Hive parser, so that the key-value pairs can be extracted?

nxl4
  • 714
  • 2
  • 8
  • 17

1 Answers1

0

Try to set SerDe property to ignore malformed JSON:

ALTER TABLE json_table SET SERDEPROPERTIES ( "ignore.malformed.json" = "true");
leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Unfortunately, that yields the same query error once the table structure is changed. – nxl4 Oct 24 '19 at 12:16