1

I want to extract from a JSON which has more JSONs nested inside, all the elements whose title is 'title2'. I have the code working on MySQL but I can not translate it into Redshift.

JSON structure: {"master-title": [{"title": "a", "title2": "b"},{"title": "c", "title2: "d", "title3": "e"}], "master-title2": [{"title": "f", "title2": "g", "title3": "h"},{"title": "i", "title2": "j", "title3": "k"}]}

MySQL query (works as desired):

select id
,json_extract(myJSON, '$**.title2')),0)
from myTable

MySQL ouput: ["b", "d","g","j"]

My problem is that on Redshift I can only specifically define the path as: JSON_EXTRACT_PATH_TEXT(myJSON, 'master-title2',0,'title') So I can only get one element instead of all of them.

Any idea how to evaluate all paths and to get all elements in a JSON array which have the same "title2" using Redshift? (same output as in MySQL)

Thank you in advance.

DonWizu
  • 23
  • 1
  • 7

2 Answers2

1

Redshift has only a very rudimentary set to JSON manipulation functions (basically JSON_EXTRACT_PATH_TEXT and JSON_EXTRACT_ARRAY_ELEMENT_TEXT). It's not enough to deal with schemaless JSON.

Python UDF

If Redshift was my only mean of processing data I would give python UDF a try. You can code a function in imperative python. Then, having a column holding your json object and just call that function on all elements to do custom extraction.

Unnesting JSON arrays

Other options would be to really try to understand the schema and implement it using the two JSON funtions mentioned before (This SO answer will give you an idea on how to explode/unnest a JSON array in Redshift). Provided your JSON is not arbitrarily nested, but follows some patterns, this could work.

Regex (better don't)

Another desperate approach would be to try to extract your data with regex - could work for simple cases, but it's an easy way to shoot yourself in the foot.

Community
  • 1
  • 1
botchniaque
  • 4,698
  • 3
  • 35
  • 63
1

Thanks for your answer.

I finally found a solution using Python. I hope it may help some others.

count=[x.count("title2") for x in df['myJSON'].tolist()]

DonWizu
  • 23
  • 1
  • 7