2

I am using AWS Athena to do some queries on AWS CloudTrail data object log entries.

The first few fields in a typical log entry look like this (pretty-printed for clarity):

{
  "Records": [
    {
      "eventVersion": "1.08",
      "userIdentity": {
        "type": "AWSAccount",
        "principalId": "",
        "accountId": "ANONYMOUS_PRINCIPAL"
      },
      "eventTime": "2021-03-23T14:04:38Z",
      "eventSource": "s3.amazonaws.com",
      "eventName": "GetObject",
      "awsRegion": "us-east-1",
      "sourceIPAddress": "12.34.45.56",
      "userAgent": "[Amazon CloudFront]",
      "requestParameters": {
        "bucketName": "mybucket",
        "Host": "mybucket.s3.amazonaws.com",
        "key": "bin/some/path/to/a/file"
      },
      "responseElements": null,
...

The AWS CloudTrail console will create a standard table to query these entries. The table is defined as this:

CREATE EXTERNAL TABLE `cloudtrail_logs_mybucket_logs`(
  `eventversion` string COMMENT 'from deserializer', 
  `useridentity` struct<type:string,principalid:string,arn:string,accountid:string,invokedby:string,accesskeyid:string,username:string,sessioncontext:struct<attributes:struct<mfaauthenticated:string,creationdate:string>,sessionissuer:struct<type:string,principalid:string,arn:string,accountid:string,username:string>>> COMMENT 'from deserializer', 
  `eventtime` string COMMENT 'from deserializer', 
  `eventsource` string COMMENT 'from deserializer', 
  `eventname` string COMMENT 'from deserializer', 
  `awsregion` string COMMENT 'from deserializer', 
  `sourceipaddress` string COMMENT 'from deserializer', 
  `useragent` string COMMENT 'from deserializer', 
  `errorcode` string COMMENT 'from deserializer', 
  `errormessage` string COMMENT 'from deserializer', 
  `requestparameters` string COMMENT 'from deserializer', 
  `responseelements` string COMMENT 'from deserializer', 
  `additionaleventdata` string COMMENT 'from deserializer', 
  `requestid` string COMMENT 'from deserializer', 
  `eventid` string COMMENT 'from deserializer', 
  `resources` array<struct<arn:string,accountid:string,type:string>> COMMENT 'from deserializer', 
  `eventtype` string COMMENT 'from deserializer', 
  `apiversion` string COMMENT 'from deserializer', 
  `readonly` string COMMENT 'from deserializer', 
  `recipientaccountid` string COMMENT 'from deserializer', 
  `serviceeventdetails` string COMMENT 'from deserializer', 
  `sharedeventid` string COMMENT 'from deserializer', 
  `vpcendpointid` string COMMENT 'from deserializer')
COMMENT 'CloudTrail table for adafruit-circuit-python-logs bucket'
ROW FORMAT SERDE 
  'com.amazon.emr.hive.serde.CloudTrailSerde' 
STORED AS INPUTFORMAT 
  'com.amazon.emr.cloudtrail.CloudTrailInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://mybucket/AWSLogs/12345678901234/CloudTrail'
TBLPROPERTIES (
  'classification'='cloudtrail', 
  'transient_lastDdlTime'='1616514617')

Note that useridentity is described as a struct, but requestParameters is a string. I would like to use the struct feature to preparse requestParameters, so I tried this:

CREATE EXTERNAL TABLE `cloudtrail_logs_mybucket_logs2`(
  `eventversion` string COMMENT 'from deserializer', 
  `useridentity` struct<type:string,principalid:string,arn:string,accountid:string,invokedby:string,accesskeyid:string,username:string,sessioncontext:struct<attributes:struct<mfaauthenticated:string,creationdate:string>,sessionissuer:struct<type:string,principalid:string,arn:string,accountid:string,username:string>>> COMMENT 'from deserializer', 
  `eventtime` string COMMENT 'from deserializer', 
  `eventsource` string COMMENT 'from deserializer', 
  `eventname` string COMMENT 'from deserializer', 
  `awsregion` string COMMENT 'from deserializer', 
  `sourceipaddress` string COMMENT 'from deserializer', 
  `useragent` string COMMENT 'from deserializer', 
  `errorcode` string COMMENT 'from deserializer', 
  `errormessage` string COMMENT 'from deserializer', 
  `requestparameters` struct<`bucketName`:string, `Host`:string, `key`:string> COMMENT 'THIS IS NEW', 
...[rest same as above]

The table is created, but trying to do a simple query using it ("Preview Table") gives this error:

GENERIC_INTERNAL_ERROR: parent builder is null

What's wrong with my attempt to use struct on requestparameters? It seems no different in terms of JSON as to what's going on with useridentity.

Dan Halbert
  • 2,761
  • 3
  • 25
  • 28
  • Total guess here - but maybe the cloudtrail SerDe doesn't support other views of the structure. You could try doing a CTAS that would rewrite the records as JSON using the Json SerDe but using the same table schema. Finally, create a new table schema pointing at the same JSON files that views requestParameters as a struct. Btw, if requestparameters is coming through as a JSON-encoded string, you can use Presto's JSON functions such as json_extract to check props inside of it – kylejmcintyre Mar 27 '21 at 02:53
  • I did end up using `json_extract_scalar` to get the fields I need; that was simple and it's working fine. The mystery is why the SerDe is able to parse the `useridentity` JSON but not the `requestparameters` JSON. – Dan Halbert Mar 27 '21 at 12:15

1 Answers1

0

You should use the json Serializer/Deserializer instead:

...
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES ( 'paths'='<LIST OF COLUMNS>' )
...

See the docs: https://docs.aws.amazon.com/athena/latest/ug/json-serde.html#hive-json-serde

Nicolas Busca
  • 1,100
  • 7
  • 14