0

I have a S3 folder with 40k++ json file where each of the files has the following format:

[{"AAA": "XXXX", "BBB": "XXXX", "CCC": "XXXX"}]

My purpose is to read these json files (in one S3 folder), combine them into one structured table perhaps to perform some transformation of the data and then load them into a MySQL table. This process will probably be needed to be run on weekly basis.

Any quicker way for doing ETL on this kind of data source? Would appreciate if you have any feasible recommendation. Thanks a lot!

Tried to read each json files through boto3 with something like 'obj.get()['Body'].read()' (in python), however, the iteration over all the files took me few hours to run.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
ahboy
  • 1
  • 2

2 Answers2

0

You could use Amazon Athena to read the JSON files, perform ETL and output the data into a CSV (or other) format. It can read the multiple files, understand JSON and do normal SQL manipulation.

If you are running Amazon Aurora, you could Load Data into an Amazon Aurora MySQL DB Cluster from Text Files in an Amazon S3 Bucket.

If it is normal MySQL, you could Load S3 Data into Amazon RDS MySQL Table with AWS Data Pipeline.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
0

If (1) you are interested in getting the data into a tabular format, and (2) comfortable with python, once you read the json file(s) (using boto3 or aws-data-wrangler) you could do something like:

path = "s3://" + your_bucket_str + "/" + your_path_str
df = wr.s3.read_json(path=path, dataset=True)

and then subsequently put the data in your desired locaiton (e.g. mysql) using mysql.connector or 'pymysql'. more details on how to accomplish this step here

xxyjoel
  • 551
  • 5
  • 7