0

Scenario:

I've got a semi-structured dataset in JSON format. I'm storing the 3 subsets (new_records, upated_records, and deleted_records) from the dataset in 3 different Amazon DynamoDB tables. Scheduled to truncate and load daily.

I'm trying to create a mapping, to source data from these DynamoDB tables, append a few metadata columns (date_created, date_modified, is_active) and consolidate the data in a master DynamoDB table

Issues and Challenges:

I tried AWS Glue - Created Data Catalogue for source tables using Crawler. I understand AWS Glue doesn't provide provisions to store data in DynamoDB, so I changed the target to Amazon S3. However, the AWS Glue job results in creating some sort of reduced form of the data (parquet objects) in my Amazon S3 bucket. I've limited experience with PySpark, Pig, and Hive, so excuse me if I'm unable to explain clearly.

Quick research on Google hinted me to read parquet objects available on Amazon S3, using Amazon Athena or Redshift Spectrum.

I'm not sure, but this looks like overkill, doesn't it?

I read about Amazon Data Pipelines, which offers to quickly transfer data between different AWS services. Although I'm not sure if it provides some mechanism to create mappings between source and target (in order to append additional columns) or does it straightaway dumps data from one service to others?

Can anyone hint at a lucid and minimalistic solution?

-- Update --

I've been able to consolidate the data from Amazon DynamoDB to Amazon Redshift using AWS Glue, which turned out to be actually quite simple.

However, with Amazon Redshift, there are a few characteristic issues - its relational nature and its inability to directly perform a single merge, or upsert to update a table are few major things I'm considering here.

I'm considering if Amazon ElasticSearch can be used here, to index and consolidate the data from Amazon DynamoDB.

Ronnie
  • 992
  • 1
  • 9
  • 25

1 Answers1

0

I'm not sure about your needs and assumptions. But let me post my thoughts that may help!

  1. Why are you planning to do this migration? Think about this carefully.
  2. Moving from 3 tables to 1 table, table size should not be an issue with DynamoDB But think about read/write unit capacity.
  3. Athena is a good option, you will write SQL to query your data, will pay based on data scanned for your query, ... But Athena has 30 minutes query timeout. (I think you can request an increase for that, not sure!)
  4. I think it is worth to try Data Pipelines. Yes, you can process the data while moving it.
Ronnie
  • 992
  • 1
  • 9
  • 25
  • I considered SQL is a plus for Athena, however some people not prefer SQL in case simple data structure and decide to use Kibana for example. So again I see it based on your needs. – Mohamed Saleh Mar 09 '20 at 12:55
  • The consolidation process is done merely to connect and merge operational databases into an OLAP system - a warehouse. So not really migrating, but laying the foundations here for data integration. – Ronnie Mar 13 '20 at 04:05
  • I have configured read-write provisions units appropriately for DynamoDB tables, so that shouldn't be a problem. – Ronnie Mar 13 '20 at 04:07
  • Please correct me if I'm wrong but I believe Amazon Athena is a query service that helps to analyze data in Amazon S3 using standard SQL. Using different AWS services for storage and querying adds unnecessary complications.I would rather prefer to use Amazon Redshift, which provides a solution for data warehouse storage, and also querying abilities. – Ronnie Mar 13 '20 at 04:15
  • I'm currently exploring the Amazon Data Pipelines documents to develop a more core understanding. – Ronnie Mar 13 '20 at 04:17
  • I do not see any complication with "Using different AWS services for storage and querying". Double check the pricing model as well. In Redshift you may lead data that infrequently accessed and you still pay for it in Redshift, that case doesn't apply for Athena because you just pay for S3 Storage. – Mohamed Saleh Mar 14 '20 at 08:43
  • Designing software architectures always follows the principle of KISS - Keep It Simple, Silly! Connecting 10 different AWS services is not a challenging task. To be able to accomplish the same with the minimum number of services, that's the aim. – Ronnie Mar 14 '20 at 09:49
  • I think I'm misreading your solution. Are you suggesting to store the three tables in S3, run Athena queries on top of them, store the query results in another S3. Is this, what you're suggesting? – Ronnie Mar 14 '20 at 09:59