3

I have a JSON array of structures in S3, that is successfully Crawled & Cataloged by Glue.

[{"key":"value"}, {"key":"value"}]

I'm using the custom Classifier:

$[*] 

When trying to query from Spectrum, however, it returns:

Top level Ion/JSON structure must be an anonymous array if and only if serde property 'strip.outer.array' is set. Mismatch occured in file...

I set that serde property manually in the Glue catalog table, but nothing changed.

Is it no possible to query an anonymous array via Spectrum?

atlas_scoffed
  • 3,542
  • 30
  • 46
  • This is still a problem until this day. Have you found anything other than what you mentioned in the answer? – Hasan Jawad Sep 08 '20 at 10:57
  • @HasanJ All implementations I have done since this post we've opted to use a single record per line, rather than an [anonymous] array... So I don't know if there is a workable solution yet or not. – atlas_scoffed Sep 08 '20 at 22:09

5 Answers5

2

Naming the array in the JSON file like this:

"values":[{"key":"value"},...}

And updating the classifier:

$.values[*]

Fixes the issue... Interested to know if there is a way to query anonymous arrays though. It seems pretty common to store data like that.

Update: In the end this solution didn't work, as Spectrum would never actually return any results. There was no error, just no results, and as of now still no solution other than using individual records per line:

{"key":"value"}
{"key":"value"}
etc.

It does seem to be a Spectrum specific issue, as Athena would still work.

Interested to know if anyone else was able to get it to work...

atlas_scoffed
  • 3,542
  • 30
  • 46
1

I've successfully done this, but without a data classifier. My JSON file looks like:

[
    {
        "col1": "data_from_col1",
        "col2": "data_from_col2",
        "col3": [
            {
                "col4": "data_from_col4",
                ...
            {
        ]
    },
    {
        "col1": "data_from_col1",
        "col2": "data_from_col2",
        "col3": [
            {
                "col4": "data_from_col4",
                ...
            {
        ]
    },
    ...
]

I started with a crawler to get a basic table definition. IMPORTANT: the crawler's configuration options under Output CAN'T be set to Update the table definition..., or else re-running the crawler later will overwrite the manual changes described below. I used Add new columns only.

I had to add the 'strip.outer.array' property AND manually add the topmost columns within my anonymous array. The original schema from the initial crawler run was:

anon_array array<struct<col1:string,col2:string,col3:array<struct<col4...>>>
partition_0 string

I manually updated my schema to:

col1:string
col2:string
col3:array<struct<col4...>>
partition_0 string

(And also add the serde param strip.outer.array.)

Then I had to rerun my crawler, and finally I could query in Spectrum like:

select o.partition_0, o.col1, o.col2, t.col4
from db.tablename o
LEFT JOIN o.col3 t on true;
MargaretT
  • 11
  • 2
  • @HassanJ I wonder if my answer would work for you and for @ comfytoday – MargaretT Oct 27 '20 at 14:28
  • That's an interesting solution. I've not seen the strip.out.array parameter. What are the first 2 columns in the schema? Can you put an example of the JSON in your answer, just so it's clear what your file looks like? Thanks for the answer! – atlas_scoffed Oct 29 '20 at 07:17
  • I added an example JSON. I already have the schema above - what the original crawler came up with and how I had to manually edit it to make it work with the strip.outer.array param. Let me know if something is still unclear. – MargaretT Nov 04 '20 at 12:25
  • I did the same thing but still used a classifier `$[*]` and it worked. The key point here, after manually editing your table make sure you run the crawler again. This let me query the data using redshift spectrum. – Hasan Jawad Mar 25 '21 at 16:11
  • This solution worked for me but my json was formatted like the OP. The only thing I needed to do was update the crawler config as described here and manually add `strip.outer.array` to the table properties manually and set it to `true`. – dev_etter Apr 05 '22 at 22:07
0

If your table's structure is as follows:

CREATE EXTERNAL TABLE spectrum.testjson(struct<id:varchar(25), 
columnName<array<struct<key:varchar(20),value:varchar(20)>>>);

you can use the following query to access the array element:

SELECT c.id, o.key, o.value FROM   spectrum.testjson c, c.columnName o;

For more information you can refer the AWS Documentation:

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data-sqlextensions.html

Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
Ajay Pant
  • 9
  • 1
0

You can use json_extract_path_text for extracting the element or json_extract_array_element_text('json string', pos [, null_if_invalid ] ).

for example: for 2nd index element select json_extract_array_element_text('[111,112,113]', 2);

output: 113

Ajay Pant
  • 9
  • 1
0

The documentation on creating Spectrum tables explains this issue.

"This parameter supports the following SerDe property for JsonSerDe:

'strip.outer.array'='true'

Processes Ion/JSON files containing one very large array enclosed in outer brackets ( [ … ] ) as if it contains multiple JSON records within the array."

You can fix it in the console by navigating to the table that gets created by Glue, go to the Serde parameters section (under "Advanced properties"), and adding a new key-value pair with strip.outer.array as the key and true as the value.

juliska
  • 1
  • 1