12

We are designing an Big data solution for one of our dashboard applications and seriously considering Glue for our initial ETL. Currently Glue supports JDBC and S3 as the target but our downstream services and components will work better with dynamodb. We are wondering what is the best approach to eventually move the records from Glue to Dynamo.

Should we write to S3 first and then run lambdas to insert the data into Dynamo? Is that the best practice? OR Should we use a third party JDBC wrapper for Dynamodb and use Glue to directly write to Dynamo (Not sure if this is possible, sounds a bit scary) OR Should we do something else?

Any help is greatly appreciated. Thanks!

Robby
  • 371
  • 2
  • 3
  • 15

4 Answers4

10

You can add the following lines to your Glue ETL script:

    glueContext.write_dynamic_frame.from_options(frame =DynamicFrame.fromDF(df, glueContext, "final_df"), connection_type = "dynamodb", connection_options = {"tableName": "pceg_ae_test"})

df should be of type DynamicFrame

Bishal Regmi
  • 119
  • 1
  • 5
  • "AWS Glue does not currently support writing to Amazon DynamoDB." https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-dynamodb – Nicus Feb 07 '20 at 15:46
  • 1
    By reading the documentation I thought I can't write to dynamodb directly, but I tried the above script and it did work – Prashanth S. Jun 24 '20 at 02:40
  • Hey, I want to update an entry if its already there in dynamodb, how can I achieve that? glueContext.write_dynamic_frame is failing when there is already an entry present with the same primary key. Please help – Eldhose Jun 23 '21 at 11:04
  • Officially only Glue versión 1 is compatible for write on dynamodb – Cristián Vargas Acevedo Oct 16 '21 at 06:12
1

I am able to write using boto3... definitly its not best approach to load but its working one. :)

dynamodb = boto3.resource('dynamodb','us-east-1') table = 
dynamodb.Table('BULK_DELIVERY')

print "Start testing"

for row in df1.rdd.collect():
    var1=row.sourceCid 
    print(var1) table.put_item( Item={'SOURCECID': "{}".format(var1)} )

print "End testing"
Vinay Agarwal
  • 197
  • 1
  • 15
1

Consider your data is in now tabular format (CSV/Excel) and the Data Source is S3. Then this is how you can move the data from Glue to DynamoDB.

The majority of the work is done in the Glue itself.

Create a crawler in the Glue and name the database, while creating the crawler and run that crawler after creating one. (This will create the schema for the data you are giving). If you have any doubt in creating the crawler go through this: https://docs.aws.amazon.com/glue/latest/ug/tutorial-add-crawler.html#:~:text=To%20create%20a%20crawler%20that,Data%20Crawler%20%2C%20and%20choose%20Next.

Go to the left pane of AWS Glue under the ETL section click on the jobs.

Click on the create job, Once done, remove the Data Target - S3, because we want our data target to be the DynamoDB.

Now click on the data source - S3 Bucket and modify the changes like add the S3 file location and apply the transform settings based on your need. Enter the data input Make sure, there are no red indications.

Now, the answer to your question comes here: Go to the script, click on the edit script and add this function in the existing code.

glue_context.write_dynamic_frame_from_options(
 frame=<name_of_the_Dataframe>,
 connection_type="dynamodb",
 connection_options={
     "dynamodb.output.tableName": "<DynamoDB_Table_Name>",
     "dynamodb.throughput.write.percent": "1.0"
 }
)

Make sure you had changed the:

frame=<name_of_the_Dataframe> "dynamodb.output.tableName": "<DynamoDB_Table_Name>" DynamoDB_Table_Name - One you had created in the DynamoDB. name_of_the_Dataframe - This will be generated automatically, check out the variable name in the first function.

Once all the above steps are done, click on the save and run the script, and refresh the DynamoDB table. This is "how", you can load the data from the Amazon S3 service to DynamoDB.

Note: The column name/feature name should not init cap.

-1

For your workloads, Amaon actually recommens using data pipelines.

It bypasses glue. So it is mostly used to load S3 files to Dynamo. But it may work.

Rafael Larios
  • 192
  • 2
  • 10