I wish to fetch a map of query params from S3 access log using Athena.
E.g. for the following log line example:
283e.. foo [17/Jun/2017:23:00:49 +0000] 76.117.221.205 - 1D0.. REST.GET.OBJECT 1x1.gif "GET /foo.bar/1x1.gif?placement_tag_id=0&r=574&placement_hash=12345... HTTP/1.1" 200 ... "Mozilla/5.0"
I want to get a map queryParams of [k, v]:
placement_tag_id,0 r,574 placement_hash,12345
So I'll be able to run queries such as:
select * from accessLogs where queryParams.placement_tag_id=0 and X.r>=500
The query params count and content differ from one request to another so I can't use a static RegEx pattern.
I used serde2.RegexSerDe
on the following Athena create table query to make a basic split of the log, but didn't find a method to achieve what I want.
I thought of using MultiDelimitSerDe but it's not supported in Athena.
Any suggestion on how to achieve that?
CREATE EXTERNAL TABLE IF NOT EXISTS elb_db.accessLogs (
timestamp string,
request string,
http_status string,
user_agent string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '[^ ]* [^ ]* \\[(.*)\\] [^ ]* [^ ]* [^ ]* [^ ]* [^ ]* "(.*?)" ([^ ]*) [^ ]* [^ ]* [^ ]* [^ ]* [^ ]* ".*?" "(.*?)" [^ ]*'
) LOCATION 's3://output/bucket'