1

I am trying a very basic hive query. I am trying to extract a json field from a dataset but I always get

\N

for the json field, however some_string comes okay

Here is my query :

WITH dataset AS (
SELECT
CAST(
   '{ "traceId": "abc", "additionalData": "{\"Star Rating\":\"3\"}",  "locale": "en_US", "content": { "contentType": "PB", "content": "T S", "bP": { "mD": { "S R": "3" }, "cType": "T_S", "sType": "unknown-s", "bTimestamp": 0, "title": "T S" } }
    }' AS STRING) AS some_string
)
SELECT some_string, get_json_object(dataset.some_string, '$.traceId') FROM dataset

Question : How can I get the json field here ?

leftjoin
  • 36,950
  • 8
  • 57
  • 116
Bhavya Arora
  • 768
  • 3
  • 16
  • 35
  • It works: http://demo.gethue.com/hue/editor?editor=342256 – leftjoin May 30 '19 at 19:06
  • sorry I think I over simplified it. additionalData in the json is giving the issue. It is a valid json though. Still it fails. I have updated the query to add addtionalData to the json in query – Bhavya Arora May 30 '19 at 21:49

1 Answers1

1

The problem is in backslash. Single backslashes treated as escape character for " and removed by Hive:

hive> select '\"';
OK
"
Time taken: 0.069 seconds, Fetched: 1 row(s)

When you have two backslashes, Hive removes one:

hive> select '\\"';
OK
\"
Time taken: 0.061 seconds, Fetched: 1 row(s)

With two backslashes it works fine:

WITH dataset AS (
  SELECT
  CAST(
     '{ "traceId": "abc", "additionalData": "{\\"Star Rating\\":\\"3\\"}",  "locale": "en_US", "content": { "contentType": "PB", "content": "T S", "bP": { "mD": { "S R": "3" }, "cType": "T_S", "sType": "unknown-s", "bTimestamp": 0, "title": "T S" } }
       }' AS STRING) AS some_string
   )
   SELECT some_string,  get_json_object(dataset.some_string, '$.traceId') FROM dataset;
OK
{ "traceId": "abc", "additionalData": "{\"Star Rating\":\"3\"}",  "locale": "en_US", "content": { "contentType": "PB", "content": "T S", "bP": { "mD": { "S R": "3" }, "cType": "T_S", "sType": "unknown-s", "bTimestamp": 0, "title": "T S" } }
    }   abc
Time taken: 0.788 seconds, Fetched: 1 row(s)

You can also easily remove double-quotes before { and after } in additionalData:

WITH dataset AS (
SELECT
regexp_replace(regexp_replace(
   '{ "traceId": "abc", "additionalData": "{\"Star Rating\":\"3\"}",  "locale": "en_US", "content": { "contentType": "PB", "content": "T S", "bP": { "mD": { "S R": "3" }, "cType": "T_S", "sType": "unknown-s", "bTimestamp": 0, "title": "T S" } }
    }' ,'\\"\\{','\\{') ,'\\}\\"','\\}' )AS some_string
)
SELECT some_string,  get_json_object(dataset.some_string, '$.traceId') FROM dataset;

Returns:

OK
{ "traceId": "abc", "additionalData": {"Star Rating":"3"},  "locale": "en_US", "content": { "contentType": "PB", "content": "T S", "bP": { "mD": { "S R": "3" }, "cType": "T_S", "sType": "unknown-s", "bTimestamp": 0, "title": "T S" } }
    }   abc
Time taken: 7.035 seconds, Fetched: 1 row(s)
leftjoin
  • 36,950
  • 8
  • 57
  • 116