8

I have written a program to write items into DynamoDB table. Now I would like to read all items from the DynamoDB table using PySpark. Are there any libraries available to do this in Spark?

Alexander Patrikalakis
  • 5,054
  • 1
  • 30
  • 48
sms_1190
  • 1,267
  • 2
  • 12
  • 24

2 Answers2

1

You can use parallel scans available as part of the DynamoDB API through boto3 and a scheme like the parallel S3 file processing application written for PySpark described here. Basically, instead of reading all the keys a-priori, just create a list of segment numbers and hard code the max number of segments for scan in the map_func function for Spark.

Alexander Patrikalakis
  • 5,054
  • 1
  • 30
  • 48
0

Another option is to export DynamoDB rows to S3. You can use a trigger on S3 to kick off Lambda or even process the resulting files manually. For me the export was a great option since it happened so quickly. It took about an hour to export 1 TB which is about 1 billion rows. From there you have complete flexibility in getting it analyzed.

Garet Jax
  • 1,091
  • 3
  • 17
  • 37