3

firstly I'm very new to DynamoDB, and AWS services in general - so I'm finding it hard when bombarded with all the details.

My problem is that I have an excel file with my data in CSV format, and I'm looking to add said data to a DynamoDB table, for easy access for the Alexa function I'm looking to build. The format of the table is as follows:

ID, Name, Email, Number, Room

1534234,    Dr Neesh Patel, Patel.Neesh@work.com,   +44 (0)3424 111111, HW101

Some of the rows have empty fields.

But everywhere I look online, there doesn't appear to be an easy way to actually achieve this - and I can't find any official means either. So with my limited knowledge of this area - I am questioning whether I'm going about this all the entirely wrong way. So firstly, am I thinking about this wrong? Should I be looking at a completely different solution for a backend database? I would have thought this would be a common task but with the lack of support or easy solutions - am I wrong?

Secondly, if I'm going about this all fine - how can it be done? I understand that the DynamoDB requires a specific JSON format - and again there doesn't appear to be a straightforward way to convert my CSV into said format.

Thanks, guys.

j panton
  • 232
  • 4
  • 16

2 Answers2

4

I had the same problem when I start using DynamoDB. When you come to distributed, big data system you really need to architect how to move data across the systems. This is where you start with it.

Clearly documented here,

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SampleData.LoadData.html

Adding more details to understand the process.

Step 1: Convert your csv to json file.

If you have small amount of data, you can use online tools.

http://www.convertcsv.com/csv-to-json.htm

 {
   "ID": 1534234,
   "Name": "Dr Neesh Patel",
   "Email": "Patel.Neesh@work.com",
   "Number": "+44 (0)3424 111111",
   "Room": "HW101"
 }

You can see how nicely it formatted remove spaces, etc., Choose the right options and perform your conversion.

If your data is huge, then you need to use big data tools to parallely process those data to convert them.

Step 2: Upload using CLI for small and one time upload

aws dynamodb batch-write-item --request-items file://data.json

If you want to regularly upload the file, you need to create a data pipeline or a different process.

Hope it helps.

Kannaiyan
  • 12,554
  • 3
  • 44
  • 83
  • 'If you want to regularly upload the file, you need to create a data pipeline or a different process', why is this? Can you give a simple breakdown of what the architecture of something like that might look like? – j panton Apr 18 '18 at 19:35
  • Check on DataPipeline, https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html – Kannaiyan Apr 18 '18 at 22:00
2

DynamoDb is cool. However, before you use it you have to know your data usage patterns. For your case, if you're only every going to query the DynamoDb table by ID then it is great. If you need to query by any one or combination of columns then well there are solutions for that:

  • Elastisearch in conjunction with DynamoDb (which can be expensive), secondary indexes on the DynamoDb table (understand that each secondary index is creating a full copy of your DynamoDb table with the columns you choose to store in the index),
  • Elasticache in conjunction with DynamoDb (for tying searches back to the ID column),
  • RDS instead of DynamoDb ('cause a sql-ish db is better when you don't know your data usage patterns and you just don't want to think about it),
  • etc.

It really depends on how much data you have and how you'll query the data that should define your architecture. For me it would come down to weighing cost and performance of each of the options available.

In terms of getting the data into your DynamoDb or RDS table:

  • AWS Glue may be able to work for you
  • AWS Lambda to programmatically get the data into your data store(s)
  • perhaps others
Zaxxon
  • 647
  • 5
  • 14
  • Well, for example, data will be queried based on the voice input of the user. A user is unlikely to say "which room is ID 1534234 in?", but would rather ask "Which room is Neesh Patel in?"... By my interpretation, that means that the 'Name' column will be the one used most commonly for queries, am I correct? My problem is that I'm looking for something that can have seamless integration with my Alexa application, so trying to keep it in-house with AWS... – j panton Apr 18 '18 at 19:31
  • 1
    Yep, that makes sense, the question could be anything. If it is anything and you have a lot of data and the query is against an unindexed column, it'll be very slow which you'd have to deal with. Take a look at this link as it goes into a little more detail on what is possible: https://stackoverflow.com/questions/37875050/how-to-store-the-data-for-the-alexa-skill-i-am-developing .. One last question, how much data will you be storing? – Zaxxon Apr 18 '18 at 20:18
  • Right now I'm a little foggy on how much data I'm storing. With this particular table, there are only 69 lines in the CSV file, but there may be another I am adding in the future which may have more columns and a couple hundred rows... I don't think that classifies as 'huge data'? But I'm looking for it to possibly automatically change its records constantly to the most up-to-date data. – j panton Apr 18 '18 at 20:36
  • 1
    Less than 300 records, no worries on using DynamoDb then. And no need for Elastisearch, Elasticache, RDS, etc... Table scans on that small set of data will be negligible. And it'll be dirt cheap. You could lower the DynamoDb autoscaling to 1 write and 3 read with an upper end of 5 and 100. Probably won't cost more than $1 a month. But that will depend on how often you read the data. – Zaxxon Apr 18 '18 at 20:43
  • Thanks, you've been really helpful. Out of interest, that kind of data WOULD require workaround solutions? – j panton Apr 18 '18 at 20:54
  • 1
    You'd have to think of it in a "per query per second" context. A DyanmoDb "read unit per second" consits of 4k of data. If your ~264 rows of data contains in total less than 4k of data, then each Alexa query will consume a single read unit (oversimplified math, see this for more details: http://techtraits.com/cloud/nosql/2012/06/28/Amazon-DynamoDB-Understanding-Query-and-Scan-operations.html). You'll need to worry about the alternatives when the total size of your DynamoDb table grows to a point where it is more expensive in time and cost to do a table scan than an alternative. – Zaxxon Apr 19 '18 at 16:42
  • I guess the last question I have is: how often will this DynamoDb table be called? More info: Amazon charges per-hour for configured read and write capacity. So 1 WCU supports 3,600 writes per hour. So 1,000 per minute * 60 minutes = 60,000 in one hour. 60,000/3,600 = 16.67 WCU. Source: aws.amazon.com/dynamodb/pricing search for "hour". – Zaxxon Apr 19 '18 at 16:42
  • Is my Primary Key/HASH key useless? If it's the ID column as shown in the question, I can't even think of a time when it will be queried because of the use cases for my application. As I mentioned above, a user would never ask 'What room is user 1534234 in?"... yet there are no other suitable primary keys to use. – j panton Apr 22 '18 at 09:41
  • Yes and no. You need a unique hash key regardless of whether you'll use it or not. In the future, if your data grows enough you'll be using the hash key as a means of selecting the row of data you're looking for. Then you'd have other mechanisms to search against, like elasticache, etc. Also, it may be that you want to associate data to a user that would never be searched on. You could put that data in another table to keep your primary table lean. Then you could use the hash key to get the data from your secondary table. – Zaxxon Apr 26 '18 at 16:40