Fetch data from oracle and process using spark in emr cluster

Question

I have an oracle table having around 30 tables. I want to dump the data from these tables for a specific time period into EMR cluster and run hive query that I have on the data. I would like to use spark and AWS EMR for performing this. This will be a scheduled job that needs to run every 4 hours. The amount of data fetched will be in the order of few 100 records (every 4 hours). How can I fetch data from oracle and run hive query on the data?

score 1 · Accepted Answer · answered May 17 '18 at 17:15

1

I would add a comment but not enough points so I write here.

If I understood you correctly, you want to fetch +/- 100 rows from Oracle every 4 hours, right ? If so, why do you need to do that with Spark or Hive ? You can't simply create a view directly in Oracle with these 100 rows every 4 hours and query it directly ? The concern is that if the data fits in your single machine and is not expected to grow quickly, you don't need any distributed solution.

answered May 17 '18 at 17:15

Bartosz Konieczny

1,985
12
27

Thanks for your response.It is about 400-500 rows each in 15 tables and about 0-100 in other 15. I need to fetch the data , do some transformation (tokenizing npi data) , create a csv file and push those file to S3 bucket. – Punter Vicky May 17 '18 at 18:54
Then I think you should be able to do that in standalone. Unless you expect to have to deal with x1000 or more rows in the future, building a cluster just to fetch at worst 9000 rows, IMO is a little bit overkill. You can write a simple Python/Scala program, use provided functions (map, filter...) to transform, one lib to generate a CSV and another one to push to S3 (AWS SDK fits well for that). – Bartosz Konieczny May 18 '18 at 03:58
Thanks @bartosz25 – Punter Vicky May 18 '18 at 03:59

Fetch data from oracle and process using spark in emr cluster

1 Answers1