0

I have an oracle table having around 30 tables. I want to dump the data from these tables for a specific time period into EMR cluster and run hive query that I have on the data. I would like to use spark and AWS EMR for performing this. This will be a scheduled job that needs to run every 4 hours. The amount of data fetched will be in the order of few 100 records (every 4 hours). How can I fetch data from oracle and run hive query on the data?

Punter Vicky
  • 15,954
  • 56
  • 188
  • 315

1 Answers1

1

I would add a comment but not enough points so I write here.

If I understood you correctly, you want to fetch +/- 100 rows from Oracle every 4 hours, right ? If so, why do you need to do that with Spark or Hive ? You can't simply create a view directly in Oracle with these 100 rows every 4 hours and query it directly ? The concern is that if the data fits in your single machine and is not expected to grow quickly, you don't need any distributed solution.

Bartosz Konieczny
  • 1,985
  • 12
  • 27
  • Thanks for your response.It is about 400-500 rows each in 15 tables and about 0-100 in other 15. I need to fetch the data , do some transformation (tokenizing npi data) , create a csv file and push those file to S3 bucket. – Punter Vicky May 17 '18 at 18:54
  • Then I think you should be able to do that in standalone. Unless you expect to have to deal with x1000 or more rows in the future, building a cluster just to fetch at worst 9000 rows, IMO is a little bit overkill. You can write a simple Python/Scala program, use provided functions (map, filter...) to transform, one lib to generate a CSV and another one to push to S3 (AWS SDK fits well for that). – Bartosz Konieczny May 18 '18 at 03:58
  • Thanks @bartosz25 – Punter Vicky May 18 '18 at 03:59