CloudTrail RunInstances event, who actually provisioned EC2 instance when STS AssumeRole used?

Question

My client is in need of an AWS spring cleaning!

Before we can terminate EC2 instances, we need to find out who provisioned them and ask if they are still using the instance before we delete them. AWS doesn't seem to provide out-of-the-box features for reporting who the 'owner'/'provisioner' of an EC2 instance is, as I understand, I need to parse through gobs of archived zipped log files residing in S3.

Problem is, their automation is making use of STS AssumeRole to provision instances. This means the RunInstances event in the logs doesn't trace back to an actual user (correct me if I'm wrong, please please I hope I am wrong).

AWS blog provides a story of a fictional character, Alice, and her steps tracing a TerminateInstance event back to a user which involves 2 log events: The TerminateInstance event and an event "somewhere around the time" of an AssumeRole event containing the actual user details. Is there a pragmatic approach one can take to correlate these 2 events?

Here's my POC that's parsing through a cloudtrail log from s3:

import boto3
import gzip
import json 

boto3.setup_default_session(profile_name=<your_profile_name>)
s3 = boto3.resource('s3')
s3.Bucket(<your_bucket_name>).download_file(<S3_path>, "test.json.gz")
with gzip.open('test.json.gz','r') as fin:
   file_contents = fin.read().replace('\n', '')
   json_data = json.loads(file_contents)
   for record in json_data['Records']:
        if record['eventName'] == "RunInstances":
            user = record['userIdentity']['userName']
            principalid = record['userIdentity']['principalId']
            for index, instance in enumerate(record['responseElements']['instancesSet']['items']):
                print "instance id: " + instance['instanceId']
                print "user name: " + user
                print "principalid " + principalid

However, the details are generic since these roles are shared by many groups. How can I find details of the user before they Assumed Role in a script?

UPDATE: Did some research and it looks like I can correlate the Runinstances event to an AssumeRole event by a shared 'accessKeyId' and that should show me the account name before it assumed a role. Tricky though. Not all RunInstances events contain this accessKeyId, for example, if 'invokedby' was an autoscaling event.

I think you have a gap in available information. How *exactly* does AssumeRole happen? Is Mallory (fictitious nemesis of Alice and Bob) running an automation script on her workstation, with her AWS credentials used to call AssumeRole? Or is she ssh-ing into a machine on EC2 that uses an IAM role, so it's not a *person* assuming the role at all? (With IAM instance roles, it's the EC2 infrastructure that calls AssumeRole and then makes the resulting temporary credentials available to the instance.) — Michael - sqlbot, Jun 03 '17 at 04:37

score 0 · Answer 1 · answered Jun 02 '17 at 01:57

Direct answer:

For the solution you are proposing, you are unfortunately out of luck. You can take a look at http://docs.aws.amazon.com/IAM/latest/UserGuide/cloudtrail-integration.html#w28aac22b9b4b7b3b1. On the 4th row, it says that the Assume Role will save the Role identity only for all subsequent calls.

I'd contact aws support to make sure of this as I might very well be mistaken.

What I would do in your case:

First, wait a couple of days in case someone had a better idea or I was mistaken and aws support answers with an out-of-the-box solution

Create an aws config rule that would delete all instances that have a certain tag. Then tell your developers to tag all instances that they are sure that should be deleted, then these will get deleted
Tag all the production instances and still needed development instances with a tag of their own
Run a script that would tag all of the untagged instances with a separate tag. Douple and triple check these instances.
Back up and turn off the instances tagged in step 3 (without deleting the instances).
If someone complained about something not being on, that means they missed an instance in step 1 or 2. Tag this instance correctly and turn it on again.
After a while (a week or so), delete the instances that are still stopped (keep the backups)
After a couple months, delete the backups that were not restored

Note that this isn't foolproof as it has the possibility of human error and possible downtime, so double and triple check, make a clone of the same environment and test on that (if you have a development environment that already has such a configuration, that would be the best scenario), take it slow to be able to monitor everything, and be sure to keep backups of everything.

Good luck and plzz tell me what your solution ended up being.

General guidelines for the future:

Note: The following points are very opiniated, and are general rules that I abide by as I find them saving me a load of trouble from time to time. Read them, dismiss what you find as unfit for you and take the things that you find reasonable.

Don't use assume role that often as it obfuscates user access. In case it was a script run on the developer's pc, let it run with their own username. If it's running on a server, keep it with the role it was created in. The amount of management will be less that way as you just cut the middle-man (the assume-role) and don't need to create roles anymore, just assign the permissions to the correct group/user. Take a look below for when I'd consider using the assume-role as a necessity.
Automate deletions. The first things you should create is automating the task of keeping the aws account as clean as possible as this would save both $$$ and debugging pain. Tags and scripts to act on these tags are very powerful tools. So if a developer needs an instance for a day to try out something new, he can create a tag that times the instance out, then there is a script that cleans it up when the time comes. These are project-specific, and not everyone needs all of these, so see and assess what you need for your project and act on them.

What I'd recommend is giving the permissions to the users themselves in the development environment as it would make tracking things to their root and finding the most knowledgeable person to solve things easier. As of the production environment, everything should be automated anyway (creation when needed and deletion when no longer needed) and no one should have any write access to that account, ever.

As for the assume-role, I only use it in case I want to give access to read-only production logs on another account. Another case would be something that really shouldn't be happening that often, if at all, but still need to give some users access to it. So, as an extra layer of protection against the 'I did it by mistake', I make them switch role to do it, and never have a script that automatically switches roles and do the action in an attempt to make it as deliberate as possible (think deleting a database and such). Another thing would be accessing sensitive information (credit-card database, etc.). Many more scenarios can occur, and here it comes to your judgement.

Again, Good Luck.

just wanted to thank you for the write up on the workflow if the pragmatic approach doesn't work. It's best to not be in this situation by tagging things in the first place with the points you mentioned. Thank you, I will be including this in my write up for management. — buildmaestro, Jun 03 '17 at 18:08

CloudTrail RunInstances event, who actually provisioned EC2 instance when STS AssumeRole used?

1 Answers1

Direct answer:

What I would do in your case:

General guidelines for the future: