9

What are the options to index large data from Oracle DB to elastic search cluster? Requirement is to index 300Million records one time into multiple indexes and also incremental updates having around approximate 1 Million changes every day.

I have tried JDBC plugin for elasticsearch river/feeder, both seems to be running inside or require locally running elastic search instance. Please let me know if there is any better option for running elastic search indexer as a standalone job (probably java based). Any suggestions will be very helpful. Thanks.

Alfabravo
  • 7,493
  • 6
  • 46
  • 82

2 Answers2

10

We use ES as a reporting db and when new records are written to SQL we take the following action to get them into ES:

  1. Write the primary key into a queue (we use rabbitMQ)
  2. Rabbit picks up the primary key (when it has time) and queries the relation DB to get the info it needs and then writes the data into ES

This process works great because it handles both new data and old data. For old data just write a quick script to write 300M primary keys into rabbit and you're done!

jhilden
  • 12,207
  • 5
  • 53
  • 76
  • 1
    This is much better than JDBC river. I had to make some changes to DB data before submitting it , and tried using bulk http api but River is much more powerful. I finally settled on https://github.com/elasticsearch/elasticsearch-river-rabbitmq/ – Shrey Dec 17 '14 at 05:44
  • after getting primary key from queue you wrote a program query DB and then use ES Rest indexing API to put in to ES? OR Rabbitmq can directly do all these?
    I am trying to avoid Rivers as it will be deprecated soon as per ES team. Trying for a standalone indexer with minimal infra requirements and failure points.
    –  Dec 17 '14 at 17:28
  • If you have more than ~500K documents, do not use Bulk rest of ES. I have tested it using nginx as well, but ES handles bulk requests very poorly on concurrent HTTP connections. RabbitMQ river interprets bulk requests and puts them in ES without any hiccups.My nodejs prog picks data from DB and puts in RmQ using amqplib. – Shrey Dec 18 '14 at 06:30
5

there are many integration options - I've listed out a few to give you some ideas, the solution is really going to depend on your specific resources and requirements though.

  1. Oracle Golden Gate will look at the Oracle DB transaction logs and feed them in real-time to ES.
  2. ETL for example Oracle Data Integrator could run on a schedule and pull data from your DB, transform it and send to ES.
  3. Create triggers in the Oracle DB so that data updates can be written to ES using a stored procedure. Or use the trigger to write flags to a "changes" table that some external process (e.g. a Java application) monitors and uses to extract data from the Oracle DB.
  4. Get the application that writes to the Oracle DB to also feed ES. Ideally your application and Oracle DB should be loosely coupled - do you have an integration platform that can feed the messages to both ES and Oracle?
Olly Cruickshank
  • 6,120
  • 3
  • 33
  • 30
  • thank you very much for the new ideas. But in my case i do have very little chance to work on the GG or ETL tool, as i am not DB guy and not much customisation is allowed here.Triggers are not allowed because of some reasons. can you please guide on the java/FE based solution? –  Dec 15 '14 at 16:39
  • If you can't change the database and won't invest in a tool then maybe you can write a Java app that looks for DB changes via a timestamp (to tell new records from old) - but this would be dependent on this info being captured. Otherwise you need to look at how data gets into the Oracle DB and see if you can leverage this. – Olly Cruickshank Dec 15 '14 at 17:25
  • Okay! Noticed one more plugin spring-data-elasticsearch plugin in ES site. But need to validate how it works, the total time it takes and performance stuff. –  Dec 16 '14 at 10:02
  • Still not ended up in good plugin/option, writing a my own java app can be done. But i am looking for available tools/plugins, so that i can avoid coding& then optimizing the indexer and focus only on transformation and BL. –  Dec 16 '14 at 10:11