1

I wish to fetch a map of query params from S3 access log using Athena.

E.g. for the following log line example:

283e.. foo [17/Jun/2017:23:00:49 +0000] 76.117.221.205 - 1D0.. REST.GET.OBJECT 1x1.gif "GET /foo.bar/1x1.gif?placement_tag_id=0&r=574&placement_hash=12345... HTTP/1.1" 200 ... "Mozilla/5.0"

I want to get a map queryParams of [k, v]:

placement_tag_id,0 r,574 placement_hash,12345

So I'll be able to run queries such as:

select * from accessLogs where queryParams.placement_tag_id=0 and X.r>=500

The query params count and content differ from one request to another so I can't use a static RegEx pattern.

I used serde2.RegexSerDe on the following Athena create table query to make a basic split of the log, but didn't find a method to achieve what I want. I thought of using MultiDelimitSerDe but it's not supported in Athena.

Any suggestion on how to achieve that?

CREATE EXTERNAL TABLE IF NOT EXISTS elb_db.accessLogs ( timestamp string, request string, http_status string, user_agent string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1', 'input.regex' = '[^ ]* [^ ]* \\[(.*)\\] [^ ]* [^ ]* [^ ]* [^ ]* [^ ]* "(.*?)" ([^ ]*) [^ ]* [^ ]* [^ ]* [^ ]* [^ ]* ".*?" "(.*?)" [^ ]*' ) LOCATION 's3://output/bucket'

E_net4
  • 27,810
  • 13
  • 101
  • 139
daddycool
  • 11
  • 4

0 Answers0