1

I am creating a Data Lake and have some tables in Glue Catalog that I need to query in Athena. As a prerequisite, Athena requires us to store the query results in a S3 bucket. I have "Temp" and "Logs" S3 buckets. But since this is client sensitive data, I just want to check should I create a new Athena bucket for this or use the existing temp/logs bucket.

Note: I dont have any future use of the Athena queries.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Gunjan Khandelwal
  • 179
  • 1
  • 2
  • 13

2 Answers2

1

That's a good point you make -- the output of the Amazon Athena queries will appear in the output files, including sensitive data.

You could create a bucket that only permits Write access -- that is, put a Deny policy on it so that nobody can GetObject from the bucket. That way, Athena is happy to write its output, but people can't see the results.

You could also apply an Amazon S3 lifecycle policy that deletes the files after one day.

An alternate method would be to trigger an AWS Lambda function as soon as the object is created and have the Lambda function delete the object.

Either way, ask people to direct their Athena output to that bucket if they don't need to access the results, or if there is sensitive data being retrieved.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • Seems like I can not put a Deny Policy on the bucket for the GetObject. On doing so, I am not able to see the result of my queries in Athena as well. I have applied the Lifecycle policy instead. I have also attached a KMS key to the bucket, for that added security. – Gunjan Khandelwal Feb 12 '21 at 02:24
  • Ooh, KMS is a _great_ idea! Users without access to the KMS key will not be able to access the object. – John Rotenstein Feb 12 '21 at 03:13
0

I would also add that Athena also keeps a history that might contain sensitive data such as PII, should that appear in your query.

Assuming the following data, DDL, and queries:

Data:

breed_id, breen_name, category
1,pug,toy
2,German Shepard, working,
3,Scottish Terrier, Working

DDL:

CREATE EXTERNAL TABLE default.dogs (
  `breed_id` int, 
  `breed_name` string, 
  `category` string
)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  LINES TERMINATED BY '\n' 
LOCATION
  's3://stack-exchange/48836509'
TBLPROPERTIES ('skip.header.line.count'='1')

Queries:

SELECT * FROM default.dogs WHERE breed_name = 'pug'
SELECT * FROM default.dogs WHERE breed_name = 'German Shepard'

We can see these in the console:

enter image description here

Based on these documentation, I believe this history is stored for 45 days.

Zerodf
  • 2,208
  • 18
  • 26