1

I have been looking at options to load (basically empty and restore) Parquet file from S3 to DynamoDB. Parquet file itself is created via spark job that runs on EMR cluster. Here are few things to keep in mind,

  1. I cannot use AWS Data pipeline
  2. File is going to contain millions of rows (say 10 million), so would need an efficient solution. I believe boto API (even with batch write) might not be that efficient ?

Are there any other alternatives ?

ranjith
  • 361
  • 4
  • 14

2 Answers2

0

Can you just refer to the Parquet files in a Spark RDD and have the workers put the entries to dynamoDB? Ignoring the challenge of caching the DynamoDB client in each worker for reuse in different rows, it some bit of scala to take a row, build an entry for dynamo and PUT that should be enough.

BTW: Use DynamoDB on demand here, as it handles peak loads well without you having to commit to some SLA.

stevel
  • 12,567
  • 1
  • 39
  • 50
  • I don't think there is a way to put from RDD to DynamoDB. At least I couldn't find any reference. Any pointers ? – ranjith Apr 23 '19 at 16:46
  • you'll have to implement it yourself, I'm afraid. But it would be "The spark way" – stevel Apr 24 '19 at 11:08
0

Look at the answer below: https://stackoverflow.com/a/59519234/4253760

To explain the process:

  1. Create desired dataframe
  2. Use .withColumn to create new column and use psf.collect_list to convert to desired collection/json format, in the new column in the same dataframe.
  3. Drop all un-necessary (tabular) columns and keep only the JSON format Dataframe columns in Spark.
  4. Load the JSON data into DynamoDB as explained in the answer.

My personal suggestion: whatever you do, do NOT use RDD. RDD interface even in Scala is 2-3 times slower than Dataframe API of any language. Dataframe API's performance is programming language agnostic, as long as you dont use UDF.

sumon c
  • 739
  • 2
  • 10
  • 18