-1

I am trying to find the most efficient way to process files in AWS.

  1. Read a json, xml, csv from S3 bucket
  2. Map it to another type of json, xml, csv
  3. Save it to S3 bucket

Right now we are using Java with AWS lambdas but we write lots of code. AWS Data Glue looks good but my experience with MS BizTalk is even better.

Is there any service that can help me with this?

Dijkgraaf
  • 11,049
  • 17
  • 42
  • 54
user3417479
  • 1,830
  • 3
  • 18
  • 23

1 Answers1

2

There are many options available within AWS for reading from one file format and writing it to another file format in s3 bucket. Below are some options -

A) AWS SDK for Pandas (DataWrangler) which is an open source Python library from AWS ProServe. You can run this either from a Lambda, or any other server. It provides several out of the box connectors for reading, writing data from various sources and sinks. This option may be used if the volumes are low. It also provides the flexibility to use this from Amazon Lambda or any other server where the SDK can be installed.

B) AWS Glue either using Spark or Python which is a is a serverless data integration service. This also provides a drag and drop option using the Glue Studio to generate data pipelines using many out of the box transformations. One can control the processing windows by using the desired number of Data Processing Units (DPUs). It also has the Glue Workflow for orchestration.

C) EMR which is a PetaByte scale AWS Service that one can use for high volume distributed data processing, machine learning, interactive analytics using open source frameworks like Apache Spark.

Which option one would choose would depend on the use cases one is trying to solve and the requirements. Other factors like volume of data, processing window, low code\no code options, cost, etc. would help decide which option to leverage.

Harvesting Data
  • 170
  • 1
  • 2
  • 12