5

I am trying to figure out how to query where I am checking the value of usage given the following table creation:

CREATE EXTERNAL TABLE IF NOT EXISTS foo.test (
     `id` string,
     `foo` struct< usages:array< struct< usage:string,
     method_id:int,
     start_at:string,
     end_at:string,
     location:array<string> >>> 
) PARTITIONED BY (
         timestamp date 
) 
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
         'serialization.format' = '1' ) LOCATION 's3://foo.bar/' TBLPROPERTIES ('has_encrypted_data'='false');

I would like to have a query like:

SELECT * FROM "foo"."test" WHERE foo.usages.usage is null;

When I do that I get:

SYNTAX_ERROR: line 1:53: Expression "foo"."usages" is not of type ROW

If I make my query where I directly index the array as seen in the following it works.

SELECT * FROM "foo"."test" WHERE foo.usages[1].usage is null;

My overall goal though is to query across all items in the usages array and find any row where at least one item in the usages array has a member usage that is null.

Piotr Findeisen
  • 19,480
  • 2
  • 52
  • 82
NSA
  • 5,689
  • 8
  • 37
  • 48

3 Answers3

8

Athena is based on Presto. In Presto 318 you can use any_match:

SELECT * FROM "foo"."test"
WHERE any_match(foo.usages, element -> element.usage IS NULL);

I think the function is not available in Athena yet, but you can emulate it using reduce.

SELECT * FROM "foo"."test"
WHERE reduce(
  foo.usages, -- array to reducing
  false, -- initial state
  (state, element) -> state OR element.usage IS NULL, -- combining function
  state -> state); -- output function (identity in this case)
Piotr Findeisen
  • 19,480
  • 2
  • 52
  • 82
2

You can achieve this by unnesting the array into rows and then check those for null values. This will result in one row per null-value entry.

select * from test
CROSS JOIN UNNEST(foo.usages) AS t(i)
where i.usage is null

So if you only nee the unique set, you must run this through a select distinct.

select distinct id from test
CROSS JOIN UNNEST(foo.usages) AS t(i)
where i.usage is null
jens walter
  • 13,269
  • 2
  • 56
  • 54
  • 1
    My intuition suggests UNNEST+DISTINCT is not going to be the most performant option. Even once Athena gets the UNNEST optimization (https://github.com/prestosql/presto/pull/901), DISTINCT may still be costly for a bigger data set. – Piotr Findeisen Sep 13 '19 at 22:10
2

Another way to emulate any_match(<array>, <function>) is with cardinality(filter(<array>, <function>)) > 0.

SELECT * FROM "foo"."test"
WHERE any_match(foo.usages, element -> element.usage IS NULL);

Becomes:

SELECT * FROM "foo"."test"
WHERE cardinality(filter(foo.usages, element -> element.usage IS NULL)) > 0
Steven
  • 2,050
  • 23
  • 20