1

I have looked into this post on s3 vs database. But I have a different use case and want to know whether s3 is enough. The primary reason for using s3 instead of other databases on cloud is because of cost.

I have multiple __scraper__s that download data from websites and apis everyday. Most of them return data as Json format. Currently, I will insert them into mongodb. I will then run analysis by querying data out on a specific date or some specific fields or records that match a certain criteria. After querying the data, usually I will load them into a dataframe and do what is needed.

The data will not be updated. They need to be stored and ready for retrieval according to some criteria. I am aware of S3 Select which may be able to do the retrieval task.

Any recommendations?

JOHN
  • 1,411
  • 3
  • 21
  • 41
  • Sorry, but what are you asking that is different to the other [question](https://stackoverflow.com/q/56108144/174777) you linked? – John Rotenstein Jan 07 '20 at 06:24
  • Indeed you may use S3 Select or Athena to process data stored in S3. Or DynamoDB (feasibility depends on what do you store and how you process the data) But what is the question? – gusto2 Jan 07 '20 at 06:53

2 Answers2

1

The use cases you have mentioned above, it seems that you are not using the MongoDB capabilities(any database capability for say) to a greater degree.

I think S3 suites well for your use cases, in fact, you should go for S3-Infrequent access with life cycle policy to archive and then finally purge to be cost efficient.

I hope it will helps!

Red Boy
  • 5,429
  • 3
  • 28
  • 41
  • 1
    Just another add-on question. Should I put everything in the same bucket or separate contents from different sources (e.g. website x, website y) into separate buckets? – JOHN Jan 09 '20 at 00:04
  • 1
    @JOHN Fundamentally, `S3` uses flat structure, so `Folders` are more of fake masking for users to think like typical directory concept we see on any other storage device. If suits your use case, you may try using concept of partitions in `Athena`. And its based on `folder` concept only. https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html http://docs.aws.amazon.com/athena/latest/ug/partitions.html – Red Boy Jan 09 '20 at 06:27
0

I think your code will be more efficient if you use dynamodb with all its feature. using s3 for database or data storage will make you code more complex. since you need to retrieve file from s3 every time and have to iterate thorough the file every time. And in case of dynamodb you can easily query and filter the data which is required. At the end s3 is a file storage and dynmodb is a database.