1

I have written a glue job which exports DynamoDb table and stores it on S3 in csv format. The glue job and the table are in the same aws account, but the S3 bucket is in a different aws account. I have been able to access cross account S3 bucket from the glue job by attaching the following bucket policy to it.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "tempS3Access",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<AWS-ACCOUNT-ID>:role/<ROLE-PATH>"
            },
            "Action": [
                "s3:Get*",
                "s3:Put*",
                "s3:List*",
                "s3:DeleteObject*"
            ],
            "Resource": [
                "arn:aws:s3:::<BUCKET-NAME>",
                "arn:aws:s3:::<BUCKET-NAME>/*"
            ]
        }
    ]
}

Now, I also want to read/access DynamoDb table from another AWS account as well. Is it possible to access cross account DynamoDb table using Crawler ? What do I need to achieve this ?

Thanks

Ashy Ashcsi
  • 1,529
  • 7
  • 22
  • 54

1 Answers1

0

Short answer: You can't. The crawler can only crawl dynamo tables in your own account.

Looong answer:
You can use my workaround.

  1. Create a trust policy in account A. The one you have made will do.
  2. In your account B create a glue job. Import boto3 and create a session in the first account. Then using the dynamodb.resource you can scan the table. Check out my code:
import boto3 . 
sts_client = boto3.client('sts',region_name='your-region')  
assumed_role_object=sts_client.assume_role(RoleArn="arn:aws:iam::accountAid:role/the-role-you-created", RoleSessionName="AssumeRoleSession1")
credentials=assumed_role_object['Credentials']
dynamodb_client = boto3.resource(
    'dynamodb',
    aws_access_key_id=credentials['AccessKeyId'],
    aws_secret_access_key=credentials['SecretAccessKey'],
    aws_session_token=credentials['SessionToken'],
    region_name='your-region'
)  

table = dynamodb_client.Table('table-to-crawl')  

response = table.scan()  

data = response['Items']

Now with this 'data', which holds all the table elements you can do a bunch of things. You can create a dynamicFrame if you wish to manipulate the data in some way:

dataF = glueContext.create_dynamic_frame.from_rdd(spark.sparkContext.parallelize(data),'data'))

Or a dataFrame if that's what you need.
I hope this helps. If you have any questions feel free to ask.

CodeDoge
  • 21
  • 2
  • 7
  • The table.scan() only scans 1MB of the table. How do I scan the entire table? – Lisa Mathew Mar 23 '20 at 23:59
  • 1
    There are a couple of things you can do. One is using the 'LastEvaluatedKey' that a large scan would return. You can create a little function that like this: `function scanAll(key){ if (key !== undefined){ table.scan( ExclusiveStartKey=data['LastEvaluatedKey'] ) }}`. Or you can use the boto paginator. I suggest you pick one of the methods in here: https://stackoverflow.com/questions/36780856/complete-scan-of-dynamodb-with-boto3 – CodeDoge Mar 24 '20 at 12:36
  • thank yo so much. I figure out a way to scan the entire table and then convert it to glue dynamic frame – Lisa Mathew Mar 24 '20 at 21:44
  • Glad I could help. – CodeDoge Mar 25 '20 at 09:45
  • I have an issue. When I try to use the list (which is full scan of my table) I get the below errror in Glue. "Command failed with exit code 1 - Yarn resource manager killed the Spark application, please refer to Spark driver logs/metrics for the diagnosis." Any idea on why this happens ? – Lisa Mathew Mar 25 '20 at 19:13
  • I haven't encountered this error before, but from what I can gather it has something to do with not enough memory for the operation. Maybe something in here: https://forums.aws.amazon.com/thread.jspa?threadID=295704 would be helpful? – CodeDoge Mar 26 '20 at 06:50