convert parquet to json for dynamodb import

Question

I am using AWS Glue jobs to backup dynamodb tables in s3 in parquet format to be able to use it in Athena.

If I want to use these parquet format s3 files to be able to do restore of the table in dynamodb, this is what I am thinking - read each parquet file and convert it into json and then insert the json formatted data into dynamodb (using pyspark on the below lines)

# set sql context
parquetFile = sqlContext.read.parquet(input_file)
parquetFile.write.json(output_path)

Convert normal json to dynamo expected json using - https://github.com/Alonreznik/dynamodb-json

Does this approach sound right? Are there any other alternatives to this approach?

You can write directly to dynamodb from spark using [emr-dynamodb-connector](https://github.com/awslabs/emr-dynamodb-connector). This way, no need to convert to json. — blackbishop, Dec 29 '19 at 11:52
Thanks, this helped, I was able to import into dynamo using hive. — nmakb, Jan 01 '20 at 02:18

score 2 · Answer 1 · answered Dec 29 '19 at 13:12

2

You can use AWS Glue to directly convert Parquet format into JSON, then create a lambda function that triggers on S3 put and load into DyanmoDB

https://medium.com/searce/convert-csv-json-files-to-apache-parquet-using-aws-glue-a760d177b45f

answered Dec 29 '19 at 13:12

AWS PS

4,420
1
9
22

score 0 · Accepted Answer · answered Dec 29 '19 at 12:04

0

Your approach will work, but you can write directly to DynamoDB. You just need to import a few jars when you run pyspark. Have a look at this: https://github.com/audienceproject/spark-dynamodb

Hope this helps.

answered Dec 29 '19 at 12:04

Napoleon Borntoparty

1,870
1
8
28

convert parquet to json for dynamodb import

2 Answers2

Linked