1

I am using DataImportHandler for indexing data in SOLR. I used full-import to index all the data in the my database which is around 10000 products.Now I am confused with the delta-import usage? Does it index the new data added into the database on interval basis i mean it is going to index the new data added to my table around 10 rows or it just updates the changes in the already indexed data.

Can anyone please explain it to me with simple example as soon as you can.

Winston Chen
  • 6,799
  • 12
  • 52
  • 81
Ram
  • 11
  • 1
  • 3

3 Answers3

4

The DataImportHandler can be a little daunting. Your initial query has loaded 10.000 unique products. This is loaded if you specify /dataimport?command=full-import. When this import is done, the DIH stores a variable ({dataimporter.last_index_time}) which is the last date/time you did this import.

In order to do an update, you specify a deltaQuery. The deltaQuery is meant to identify the records that have changed in your database since the last update. So, you specify a query like this: SELECT product_id FROM sometable WHERE [date_update] >= '${dataimporter.last_index_time}' This will retrieve all the product_ids from your database that are updated since you last full update. The next query (deltaImportQuery) you need to specify is the query that will retrieve the full record for each product_id that you have from the previous step.

Assuming product_id is you unique key, solr will figure out that it needs to update an existing record, or add one if the product_id doens't work.

In order to execute the deltaQuery and the deltaImportQuery you use /dataimport?command=delta-import

This is a great simplification of all the possibilities, check the Solr wiki on DataImportHandler, it is a VERY powerful tool!

ReDeVries
  • 53
  • 3
  • Is [date_update] a timestamp stored in the database? If so, cannot this create an issues when the date of the Database Server is not exactly in sync with the server on which SOLR is installed? – mrd3650 Dec 27 '11 at 09:57
  • date_update is indeed a database timestamp. What happens, is that this exact date is stored on your solr server, and is used for a subsequent call. No problems with the sync, the database timestamp drives the process. – ReDeVries Feb 14 '12 at 23:04
  • Ok, but then '${dataimporter.last_index_time}' must be set to the database timestamp no? However from my understanding, it is SOLR itself which sets the '${dataimporter.last_index_time}' variable when indexing finishes. So is there a way to set '${dataimporter.last_index_time}' manually to reflect database time? – mrd3650 Feb 15 '12 at 08:18
  • The actual data is contained in a file that lives in your solr/config directory. It is a text file, called dataimport.properties – ReDeVries Feb 15 '12 at 13:22
  • Yes in fact I am aware of the file itself, however isn't this timestamp performed by SOLR itself (ie when SOLR finishes the import, SOLR takes the current timestamp of its server and puts it in this file no?) – mrd3650 Feb 15 '12 at 16:44
3

On another note:

When you use a delta import within a small time window (like a couple of times in a few seconds) and the database server is on an other machine than the solr index service, make sure that the systemtime of both machines matches, since the timestamp of [date_update] is generated on the database server and dataimporter.last_index_time is generated on the other.

Otherwise you won't be updating the index (or too much) depending on the time differences.

ronalchn
  • 12,225
  • 10
  • 51
  • 61
JiDo
  • 31
  • 1
0

I agree that the Data Import Handler can handle this situation. One important limitation to the DIH is that it does not queue requests. The result of this is that if the DIH is "busy" indexing it will ignore all future DIH requests until it is "idle" again. The skipped DIH requests are lost and not executed.