I have a S3 bucket with 500,000+ json
records, eg.
{
"userId": "00000000001",
"profile": {
"created": 1539469486,
"userId": "00000000001",
"primaryApplicant": {
"totalSavings": 65000,
"incomes": [
{ "amount": 5000, "incomeType": "SALARY", "frequency": "FORTNIGHTLY" },
{ "amount": 2000, "incomeType": "OTHER", "frequency": "MONTHLY" }
]
}
}
}
I created a new table in Athena
CREATE EXTERNAL TABLE profiles (
userId string,
profile struct<
created:int,
userId:string,
primaryApplicant:struct<
totalSavings:int,
incomes:array<struct<amount:int,incomeType:string,frequency:string>>,
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://profile-data'
I am interested in the incomeTypes
, eg. "SALARY"
, "PENSIONS"
, "OTHER"
, etc.. and ran this query changing jsonData.incometype
each time:
SELECT jsonData
FROM "sampledb"."profiles"
CROSS JOIN UNNEST(sampledb.profiles.profile.primaryApplicant.incomes) AS la(jsonData)
WHERE jsonData.incometype='SALARY'
This worked fine with CROSS JOIN UNNEST
which flattened the incomes array so that the data example above would span across 2 rows. The only idiosyncratic thing was that CROSS JOIN UNNEST
made all the field names lowercase, eg. a row looked like this:
{amount=1520, incometype=SALARY, frequency=FORTNIGHTLY}
Now I have been asked how many users have two or more "SALARY"
entries, eg.
"incomes": [
{ "amount": 3000, "incomeType": "SALARY", "frequency": "FORTNIGHTLY" },
{ "amount": 4000, "incomeType": "SALARY", "frequency": "MONTHLY" }
],
I'm not sure how to go about this.
How do I query the array of structures to look for duplicate
incomeTypes
of"SALARY"
?Do I have to iterate over the array?
What should the result look like?