Connecting DynamoDB from Spark program to load all items from one table using Python?

Question

I have written a program to write items into DynamoDB table. Now I would like to read all items from the DynamoDB table using PySpark. Are there any libraries available to do this in Spark?

No, I just used what boto documentation has provided without spark. — sms_1190, May 09 '16 at 20:58
I would try to tweak this code - https://github.com/bchew/dynamodump — Tom Ron, Aug 24 '16 at 12:17

score 1 · Answer 1 · answered Feb 13 '17 at 13:20

You can use parallel scans available as part of the DynamoDB API through boto3 and a scheme like the parallel S3 file processing application written for PySpark described here. Basically, instead of reading all the keys a-priori, just create a list of segment numbers and hard code the max number of segments for scan in the map_func function for Spark.

score 0 · Answer 2 · answered Mar 11 '21 at 20:51

Another option is to export DynamoDB rows to S3. You can use a trigger on S3 to kick off Lambda or even process the resulting files manually. For me the export was a great option since it happened so quickly. It took about an hour to export 1 TB which is about 1 billion rows. From there you have complete flexibility in getting it analyzed.

Connecting DynamoDB from Spark program to load all items from one table using Python?

2 Answers2