16

This question has been already posted on AWS forums, but yet remains unanswered https://forums.aws.amazon.com/thread.jspa?threadID=94589

I'm trying to to perform an initial upload of a long list of short items (about 120 millions of them), to retrieve them later by unique key, and it seems like a perfect case for DynamoDb.

However, my current write speed is very slow (roughly 8-9 seconds per 100 writes) which makes initial upload almost impossible (it'd take about 3 months with current pace).

I have read AWS forums looking for an answer and already tried the following things:

  1. I switched from single "put_item" calls to batch writes of 25 items (recommended max batch write size), and each of my items is smaller than 1Kb (which is also recommended). It is very typical even for 25 of my items to be under 1Kb as well, but it is not guaranteed (and shouldn't matter anyway as I understand as only single item size is important for DynamoDB).

  2. I use the recently introduced EU region (I'm in the UK) specifying its entry point directly by calling set_region('dynamodb.eu-west-1.amazonaws.com') as there is apparently no other way to do that in PHP API. AWS console shows that the table in a proper region, so that works.

  3. I have disabled SSL by calling disable_ssl() (gaining 1 second per 100 records).

Still, a test set of 100 items (4 batch write calls for 25 items) never takes less than 8 seconds to index. Every batch write request takes about 2 seconds, so it's not like the first one is instant and consequent requests are then slow.

My table provisioned throughput is 100 write and 100 read units which should be enough so far (tried higher limits as well just in case, no effect).

I also know that there are some expenses on request serialisation so I can probably use the queue to "accumulate" my requests, but does that really matter that much for batch_writes? And I don't think that is the problem because even a single request takes too long.

I found that some people modify the cURL headers ("Expect:" particularly) in the API to speed the requests up, but I don't think that is a proper way, and also the API has been updated since that advice was posted.

The server my application is running on is fine as well - I've read that sometimes the CPU load goes through the roof, but in my case everything is fine, it's just the network request that takes too long.

I'm stuck now - is there anything else I can try? Please feel free to ask for more information if I haven't provided enough.

There are other recent threads, apparently on the same problem, here (no answer so far though).

This service is supposed to be ultra-fast, so I'm really puzzled by that problem in the very beginning.

Simon Dugré
  • 17,980
  • 11
  • 57
  • 73
Yuriy
  • 1,964
  • 16
  • 23
  • Sounds like you need a relational database like SQL Server. Just `SqlBulkCopy` the data in. SQL Server is web scale, if you're asking. – ta.speot.is May 21 '12 at 11:23
  • I don't need relational DB here (it's a flat index with no actual relations) but yes, I'm thinking of retreating to mySQL or Solr if I have no other options. Yet for now I'm still keen to understand what's wrong with that approach. – Yuriy May 21 '12 at 11:29
  • Your forum post has been replied to: https://forums.aws.amazon.com/thread.jspa?messageID=365597#365597 – Jeremy Lindblom Jul 18 '12 at 15:48
  • Thanks, will give it another try should the need arise again. – Yuriy Jul 18 '12 at 18:38

3 Answers3

11

If you're uploading from your local machine, the speed will be impacted by all sorts of traffic / firewall etc between you and the servers. If I call DynamoDB each request takes 0.3 of a second simply because of the time to travel to/from Australia.

My suggestion would be to create yourself an EC2 instance (server) with PHP, upload the script and all files to the EC2 server as a block and then do the dump from there. The EC2 server shuold have the blistering speed to the DynamoDB server.

If you're not confident about setting up EC2 with LAMP yourself, then they have a new service "Elastic Beanstalk" that can do it all for you. When you've completed the upload, simply burn the server - and hopefully you can do all that within their "free tier" pricing structure :)

Doesn't solve long term issues of connectivity, but will reduce the three month upload!

Robbie
  • 17,605
  • 4
  • 35
  • 72
  • Thanks for your answer. I didn't try Beanstalk, but was trying to use Elastic MapReduce instead - there still a problem here I have created another question for: http://stackoverflow.com/questions/10683136/amazon-elastic-mapreduce-mass-insert-from-s3-to-dynamodb-is-incredibly-slow – Yuriy May 21 '12 at 11:23
  • Like you mentioned even from Australia it's still under 0.5 sec for you, so can't be 2 seconds for me from London to Ireland. Our connection is very good, so far I rule that out. – Yuriy May 21 '12 at 11:25
  • 2 seconds is insanely slow, but it might be a simple as a firewall on the server doing some "checks", or a firewall on the router doing other "checks". (Or, being cynical, a way for Amzon to push you twowards EC2, possibly?!) As I said - it's not a long terms solution, just something to get the upload done. If you want to keep it locally, why not look at Cassandra or Mongo? But if you're using Amazon and paying, just shift the server there - it'll keep them happy :) – Robbie May 21 '12 at 11:41
1

I would try a multithreaded upload to increase throughput. Maybe add threads one at a time and see if the throughput increases linearly. As a test you can just run two of your current loaders at the same time and see if they both go at the speed you are observing now.

Chris Seline
  • 6,943
  • 1
  • 16
  • 14
0

I had good success using the php sdk by using the batch method on the AmazonDynamoDB class. I was able to run about 50 items per second from an EC2 instance. The method works by queuing up requests until you call the send method, at which point it executes multiple simultaneous requests using Curl. Here are some good references:

http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/LoadData_PHP.html

http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/LowLevelPHPItemOperationsExample.html

I think you can also use HIVE sql using Elastic Map Reduce to bulk load data from a CSV file. EMR can use multiple machines to spread the work load and achieve high concurrency.

  • 1
    Thanks, Jonathan, but I have re-written the functionality to use a local index. Regarding HIVE, there is also a problem in it when it comes to DynamoDB which has been confirmed by Amazon (see my another question and my self-posted answer): http://stackoverflow.com/questions/10683136/amazon-elastic-mapreduce-mass-insert-from-s3-to-dynamodb-is-incredibly-slow – Yuriy Jul 02 '12 at 15:28