7

I am using Firehose and Glue to ingest data and convert JSON to the parquet file in S3.

I was successful to achieve it with normal JSON (not nested or array). But I am failed for a nested JSON array. What I have done:

the JSON structure

{
    "class_id": "test0001",
    "students": [{
        "student_id": "xxxx",
        "student_name": "AAAABBBCCC",
        "student_gpa": 123
    }]
}

the Glue schema

  1. class_id : string
  2. students : array ARRAY<STRUCT<student_id:STRING,student_name:STRING,student_gpa:INT>>

I receive error:

The schema is invalid. Error parsing the schema: Error: type expected at the position 0 of 'ARRAY<STRUCT<student_id:STRING,student_name:STRING,student_gpa:INT>>' but 'ARRAY' is found.

Any suggestion is appreciated.

Rob
  • 14,746
  • 28
  • 47
  • 65
franco phong
  • 2,219
  • 3
  • 26
  • 43

1 Answers1

9

I ran into that because I created schemas manually in the AWS console. The problem is, that it shows some help text next to form to enter your nested data which capitalizes everything, but Parquet can only work with lowercase definitions.

Write despite the example given by AWS:

array<struct<student_id:string,student_name:string,student_gpa:int>>
ben
  • 1,819
  • 15
  • 25