I have written a program to write items into DynamoDB table. Now I would like to read all items from the DynamoDB table using PySpark. Are there any libraries available to do this in Spark?
Asked
Active
Viewed 3,262 times
8

Alexander Patrikalakis
- 5,054
- 1
- 30
- 48

sms_1190
- 1,267
- 2
- 12
- 24
-
We're you able to do this? – rabz100 May 09 '16 at 20:54
-
No, I just used what boto documentation has provided without spark. – sms_1190 May 09 '16 at 20:58
-
1I would try to tweak this code - https://github.com/bchew/dynamodump – Tom Ron Aug 24 '16 at 12:17
-
Any sample on how you got this to work? sms_1190 – ZZzzZZzz Sep 22 '17 at 17:28
2 Answers
1
You can use parallel scans available as part of the DynamoDB API through boto3 and a scheme like the parallel S3 file processing application written for PySpark described here. Basically, instead of reading all the keys a-priori, just create a list of segment numbers and hard code the max number of segments for scan in the map_func
function for Spark.

Alexander Patrikalakis
- 5,054
- 1
- 30
- 48
0
Another option is to export DynamoDB rows to S3. You can use a trigger on S3 to kick off Lambda or even process the resulting files manually. For me the export was a great option since it happened so quickly. It took about an hour to export 1 TB which is about 1 billion rows. From there you have complete flexibility in getting it analyzed.

Garet Jax
- 1,091
- 3
- 17
- 37