What is the best approach for text processing in AWS?

Asked Aug 01 '18 at 11:03

Active Aug 01 '18 at 11:03

Viewed 24 times

This is kind of a "big data with Amazon web services" question: consider a massive set of txt files (all with the same content format inside: [title;body;author]). I want to store them in AWS and be able to search a substring in the whole set. What would be a good approach for doing so. I would also thank advises on how to store all this data instead of txt files (we are talking about articles with title, body and author).

Thank you.

asked Aug 01 '18 at 11:03

L RodMrez

I am not a big data expert, but you might be looking for EMR (Elastic Map Reduce) in AWS. Checkout its documentation [here](https://aws.amazon.com/emr/) – Rhythem Aggarwal Aug 01 '18 at 11:29
2

It it is text only then why not store and search with AWS Elasticsearch https://aws.amazon.com/elasticsearch-service/ – Kush Vyas Aug 01 '18 at 11:37
1

How big is your data size (GB, TB, PB)? How fast are you expecting the results (sub-second, seconds, minutes)? How many concurrent queries (one, thousands)? What are your technical skill bases (SQL, languages, Hadoop)? – John Rotenstein Aug 01 '18 at 11:53
I would suggest at least converting said file into Parquet or ORC format, then Athena queries will be fast. Otherwise, stick the data in RDS or Redshift – OneCricketeer Aug 06 '18 at 12:37

What is the best approach for text processing in AWS?

0 Answers0