-1

I have one data analytics requirement on AWS. I have limited knowledge on Big Data processing, but based on my analysis, I have figured out some options.
The requirement is to collect data by calling a Provider API every 30 mins. (data ingestion) The data is mainly structured. This data need to be stored in a storage (S3 data lake or Red Shift.. not sure)and various aggregations/dimensions from this data are to be provided through a REST API. There is a future requirement to run ML algorithms on the original data and hence the storage need to be decided accordingly. So based on this, can you suggest:

  1. How to ingest data (Lambda to run at a scheduled interval and pull data, store in the storage OR any better way to pull data in AWS)
  2. How to store (store in S3 or RedShift)
  3. Data Analytics (currently some monthly, weekly aggregations), what tools can be used? What tools to use if I am storing data in S3.
  4. Expose the analytics results through an API. (Hope I can use Lambda to query the Analytics engine in the previous step)
Sudheer Kumar
  • 311
  • 4
  • 16

1 Answers1

1

Ingestion is simple. If the retrieval is relatively quick, then scheduling an AWS Lambda function is a good idea.

However, all the answers to your other questions really depend upon how you are going to use the data, and then work backwards.

For Storage, Amazon S3 makes sense at least for the initial storage of the retrieved data, but might (or might not) be appropriate for the API and Analytics.

If you are going to provide an API, then you will need to consider how the API code (eg using AWS API Gateway) will need to retrieve the data. For example, is it identical to the blob of data original retrieved, or are there complex transformations required or perhaps combining of data from other locations and time intervals. This will help determine how the data should be stored so that it is easily retrieved.

Data Analytics needs will also drive how your data is stored. Consider whether an SQL database sufficient. If there are millions and billions of rows, you could consider using Amazon Redshift. If the data is kept in Amazon S3, then you might be able to use Amazon Athena. The correct answer depends completely upon how you intend to access and process the data.

Bottom line: Consider first how you will use the data, then determine the most appropriate place to store it. There is no generic answer that we can provide.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • Thanks John for the idea validation and suggestions for a AWS-native solution. I am looking in to the other options too like Databricks for analytics, which can server the APIs too. Will update in future on how it goes... – Sudheer Kumar Jun 17 '21 at 05:00
  • Even databricks doesn't provide good options to expose the data tables through REST API. I found some good native solutions using (Glue Catalog + Athena), (Parquet + Redshift spectrum), even they are not fast scalable solutions. I guess have to go with the OLTP solutions (mySql etc) and for ML, need to bring them back to S3 or so.. If anybody is ware of any FAST, SCALABLE Data API solutions, please enlighten. – Sudheer Kumar Jun 26 '21 at 05:42