5

I have a S3 bucket with 500,000+ json records, eg.

{
  "userId": "00000000001",
  "profile": {
    "created": 1539469486,
    "userId": "00000000001",
    "primaryApplicant": {
      "totalSavings": 65000,
      "incomes": [
        { "amount": 5000, "incomeType": "SALARY", "frequency": "FORTNIGHTLY" },
        { "amount": 2000, "incomeType": "OTHER", "frequency": "MONTHLY" }
      ]
    }
  }
}

I created a new table in Athena

CREATE EXTERNAL TABLE profiles (  
  userId string,
  profile struct<
    created:int,
    userId:string,
    primaryApplicant:struct<
      totalSavings:int,
      incomes:array<struct<amount:int,incomeType:string,frequency:string>>,
    >
  >
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://profile-data'

I am interested in the incomeTypes, eg. "SALARY", "PENSIONS", "OTHER", etc.. and ran this query changing jsonData.incometype each time:

SELECT jsonData
FROM "sampledb"."profiles"

CROSS JOIN UNNEST(sampledb.profiles.profile.primaryApplicant.incomes) AS la(jsonData)

WHERE jsonData.incometype='SALARY'

This worked fine with CROSS JOIN UNNEST which flattened the incomes array so that the data example above would span across 2 rows. The only idiosyncratic thing was that CROSS JOIN UNNEST made all the field names lowercase, eg. a row looked like this:

{amount=1520, incometype=SALARY, frequency=FORTNIGHTLY}

Now I have been asked how many users have two or more "SALARY" entries, eg.

      "incomes": [
        { "amount": 3000, "incomeType": "SALARY", "frequency": "FORTNIGHTLY" },
        { "amount": 4000, "incomeType": "SALARY", "frequency": "MONTHLY" }
      ],

I'm not sure how to go about this.

  1. How do I query the array of structures to look for duplicate incomeTypes of "SALARY"?

  2. Do I have to iterate over the array?

  3. What should the result look like?

Piotr Findeisen
  • 19,480
  • 2
  • 52
  • 82
tea
  • 568
  • 2
  • 10
  • 18

2 Answers2

5

UNNEST is a very powerful feature, and it's possible to solve this problem using it. However, I think using Presto's Lambda functions is more straight forward:

SELECT COUNT(*)
FROM sampledb.profiles
WHERE CARDINALITY(FILTER(profile.primaryApplicant.incomes, income -> income.incomeType = 'SALARY')) > 1

This solution uses FILTER on the profile.primaryApplicant.incomes array to get only those with an incomeType of SALARY, and then CARDINALITY to extract the length of that result.


Case sensitivity is never easy with SQL engines. In general I think you should not expect them to respect case, and many don't. Athena in particular explicitly converts column names to lower case.

Theo
  • 131,503
  • 21
  • 160
  • 205
  • Thanks for explaining and offering a great solution @Theo. This was a very fast and effective. – tea Mar 17 '19 at 22:32
4

You can combine filter with cardinality to filter array elements having incomeType = 'SALARY' more than once.

This can be further improve so that intermediate array is not materialized by using reduce (see examples in the docs; I'm not quoting them here, since they do not directly answer your question).

Piotr Findeisen
  • 19,480
  • 2
  • 52
  • 82