1

Here is what we came up with. By using 3 value status column.

0 = Not indexed
1 = Updated
2 = Indexed

There will be 2 jobs...

Job 1 will select top X records where status = 0 and pop them into a queue like RabitMQ. Then a consumer will bulk insert those records to ES and update the status of DB records to 1.

For updates, since we have control of our data... The SQL stored proc that updates that particular record will set it's status to 2. Job2 will select top x records where status = 2 and pop them on RabitMQ. Then a consumer will bulk insert those records to ES and update the status of DB records to 1.

Of course we may need an intermediate status for "queued" so none of the jobs pick up the same record again but the same job should not run if it hasn't completed. The chances of a queued record being updated are slim to none. Since updates only happen at end of day usually the next day.

So I know there's rivers (but being deprecated and probably not flexible like ETL)

I would like to bulk insert records from my SQL server to Elasticsearch.

Write a scheduled batch job of some sort either ETL or any other tool doesn't matter.

select from table where id > lastIdInsertedToElasticSearch this will allow to load the latest records into Elasticsearch at scheduled interval.

But what if a record is updated in the SQL server? What would be a good pattern to track updated records in the SQL server and then push the updated records in ES? I know ES has document versions when putting the same Id. But can't seem to be able to visualize a pattern.

Community
  • 1
  • 1
user432024
  • 4,392
  • 8
  • 49
  • 85
  • Doesn't this seem too complex to you? What advantage are you particularly getting out of this solution than using just queue for writing/updating and let the application push into the queue? I ask that because it seems you will have to hack some sort of a lock to achieve this as you have already mentioned that you will have to maintain status for "queued". And honestly, the only performance gain that your application will see in this solution is not writing to RabbitMQ, which will be negligible if your application and Rabbit are in the same datacenter. – vaidik Jul 05 '14 at 07:43
  • So I suggest that you pick simplicity over this minute little performance gain because it will just make your life difficult when something goes wrong and make it tough to debug your application and consumer logic. – vaidik Jul 05 '14 at 07:45
  • The status column is not for performance. It's to know what records from the DB have already been inserted into Elasticsearch and what records have been updated in the DB so then the batch job can update Elasticsearch. – user432024 Jul 07 '14 at 13:30

2 Answers2

2

So IMHO, batch inserts are good for building or re-building the index. So for the first time, you can run batch jobs that run SQL queries and perform bulk updates. Rivers, as you correctly pointed out, don't provide a lot of flexibility in terms of transformation.

If the entries in your SQL data store are created by you (i.e. some codebase in your control), it would be better that the same code base updates documents in Elasticsearch, may be not directly but by notifying some other service or with the help of queues to not waste time in responding to requests (if that's the kind of setup you have).

We have a pretty similar use case of Elasticsearch. We provide search inside our app, which performs search across different categories of data. Some of this data is actually created by the users of our app through our app - so we handle this easily. Our app writes that data to our SQL data store and pushes the same data in RabbitMQ for indexing/updating in Elasticsearch. On the other side of RabbitMQ, we have a consumer written in Python that basically replaces the entire document in Elasticsearch. So the corresponding rows in our SQL datastore and documents in Elasticsearch share the ID which enables us to update the document.

Another case is where there are a few types of data that we perform search on comes from some 3rd party service which exposes the data over their HTTP API. The data creation is in our control but we don't have an automated mechanism of updating the entries in Elasticsearch. In this case, we basically run a cron job that takes care of this. We have managed to tune the cron's schedule because we also have a limited number of API queries quota. But in this case, our data is not really updated so much per day. So this kind of system works for us.

vaidik
  • 2,191
  • 1
  • 16
  • 22
  • I'll up-vote your answer because no one system is alike but theirs always similarities! But good to know how different systems behave. I like the RabitMQ idea. And I will update my question with what I came up with. – user432024 Jul 04 '14 at 16:39
1

Disclaimer: I co-developed this solution.

I needed something like the jdbc-river that could do more complex "roll-ups" of data. After careful consideration of what it would take to modify the jdbc-river to suit my needs, I ended up writing the river-net.

Here are a few of the features:

  • It gets fairly decent performance (comparable to the jdbc-river. We get upwards of 6k rows/sec)
  • It can join many tables to create complex nested arrays of documents without creating duplicate child documents
  • It follows a lot of the same conventions as the jdbc-river.
  • It also supports reading from files.
  • It's written in C#
  • It uses Quartz.Net and supports cron expressions for scheduling.

This project is open source, and we already have a second project (also to be open sourced) that does generic job scheduling with RabbitMQ. We have ported over a lot of this project, and plan to the RabbitMQ river for better performance and stability when indexing into Elasticsearch.

To combat large updates, we aren't hitting tables directly. Instead we use stored procedures that only grab deltas. We also have an option on the sp to reset the delta to reindex everything.

The project is fairly young with only a few commits, but we are open to collaboration and new ideas.

Josh C.
  • 4,303
  • 5
  • 30
  • 51
  • But the thing with the river it selects all records from the table and them produces the deltas? What happens if you have millions of rows? – user432024 Jul 04 '14 at 17:48
  • @user432024 we aren't hitting tables directly. Instead we are hitting stored procedures, and yes were are only grabbing deltas. We also have an option on the sp to reset the delta. – Josh C. Jul 04 '14 at 18:23
  • I mean that the ES river does something like select * from table. To me this mean it returns every single record. So if you have large amount records like 5 million it ain't the most efficient thing to do. – user432024 Jul 07 '14 at 13:32
  • @user432024 we are maintaining a billion records in elasticsearch with river-net. We are using stored procedures instead of directly selecting from the table, and in each stored procedure, we are making sure we only select what has changed. – Josh C. Jul 07 '14 at 14:14
  • Ah that makes more sense... Ah ok so how do you track new records in SQL server and only push those to ES and how do you track if updated records in SQL DB and push those to ES? Bassically how do you check for deltas? – user432024 Jul 07 '14 at 14:27
  • @user432024 depending on the version of SQL Server that answer will vary. You probably should ask that in a separate SO question. – Josh C. Jul 07 '14 at 14:29
  • Humor me though! Just curious! :) Maybe will give me some ideas of what to look for... – user432024 Jul 07 '14 at 15:26
  • timestamps, triggers, CDC (http://msdn.microsoft.com/en-us/library/bb933994.aspx)... – Josh C. Jul 07 '14 at 15:37