0

I'm working with SQL Presto in Athena and in a table I have a column named "data.input.additional_risk_data.basket" that has a json like this:

[
   {
      "data.input.additional_risk_data.basket.val.brand":null,
      "data.input.additional_risk_data.basket.val.category":null,
      "data.input.additional_risk_data.basket.val.item_reference":"26484651",
      "data.input.additional_risk_data.basket.val.name":"Nike Force 1",
      "data.input.additional_risk_data.basket.val.product_name":null,
      "data.input.additional_risk_data.basket.val.published_date":null,
      "data.input.additional_risk_data.basket.val.quantity":"1",
      "data.input.additional_risk_data.basket.val.size":null,
      "data.input.additional_risk_data.basket.val.subCategory":null,
      "data.input.additional_risk_data.basket.val.unit_price":769.0,
      "data.input.additional_risk_data.basket.val.upc":null,
      "data.input.additional_risk_data.basket.val.url":null
   }
]

I need to extract some of the data there, for example data.input.additional_risk_data.basket.val.item_reference. I'm not used to working with jsons but I tried a few things:

json_extract("data.input.additional_risk_data.basket", '$.data.input.additional_risk_data.basket.val.item_reference')
json_extract_scalar("data.input.additional_risk_data.basket", '$.data.input.additional_risk_data.basket.val.item_reference)

They all returned null. I'm wondering what is the correct way to get the values from that json Thank you!

Guru Stron
  • 102,774
  • 10
  • 95
  • 132
  • 1
    I'm not an athena expert, but it looks like `"data.input.additional_risk_data.basket.val.item_reference"` is the entire key. It's a single string, not a nested set of objects. Which is odd. I'm guessing it should be something like `json_extract(yourjsoncolumn, '$."data.input.additional_risk_data.basket.val.item_reference"')` – JNevill May 12 '22 at 16:28
  • Isn't that what I tried here? json_extract("data.input.additional_risk_data.basket", '$.data.input.additional_risk_data.basket.val.item_reference') – Francisco May 12 '22 at 17:04
  • No. This: `'$.data.input.additional_risk_data.basket.val.item_reference'` is a jsonpath. It says, pull the item that is located at item_reference inside of val inside of basket inside of additional_risk_data inside of input inside of data inside the root of the json object/string. BUT what you want is the item at "data.input.additional_risk_data.basket.val.item_reference" at the root of the document. Your value you are after is one layer under root and referenced at that long string. It's a subtle difference, but important. – JNevill May 12 '22 at 19:51
  • What I'm not sure is how to specify that jsonpath and escape the period characters to say `$.` in Athena's `json_extract`. I would imagine the answer below is the right way to go about that. – JNevill May 12 '22 at 19:54
  • Ahh right, I get it now. Thanks JNevill. (These column names are terrible haha) – Francisco May 13 '22 at 12:19
  • They really are terrible. It's almost like someone had some deeply nested data and collapsed it, using the path as the `key` and then sent it to you. – JNevill May 13 '22 at 16:05

1 Answers1

0

There are multiple "problems" with your data and json path selector. Keys are not conventional (and I have not found a way to tell athena to escape them) and your json is actually an array of json objects. What you can do - cast data to an array and process it. For example:

-- sample data
WITH dataset (json_val) AS (
    VALUES  (json '[
   {
      "data.input.additional_risk_data.basket.val.brand":null,
      "data.input.additional_risk_data.basket.val.category":null,
      "data.input.additional_risk_data.basket.val.item_reference":"26484651",
      "data.input.additional_risk_data.basket.val.name":"Nike Force 1",
      "data.input.additional_risk_data.basket.val.product_name":null,
      "data.input.additional_risk_data.basket.val.published_date":null,
      "data.input.additional_risk_data.basket.val.quantity":"1",
      "data.input.additional_risk_data.basket.val.size":null,
      "data.input.additional_risk_data.basket.val.subCategory":null,
      "data.input.additional_risk_data.basket.val.unit_price":769.0,
      "data.input.additional_risk_data.basket.val.upc":null,
      "data.input.additional_risk_data.basket.val.url":null
   }
]')

) 

--query
select arr[1]['data.input.additional_risk_data.basket.val.item_reference'] item_reference -- or use unnest if there are actually more than 1 element in array expected
from(
    select cast(json_val as array(map(varchar, json))) arr
    from dataset
)

Output:

item_reference
"26484651"
Guru Stron
  • 102,774
  • 10
  • 95
  • 132