2

I'm trying to use the DataImportHandler to keep my index in sync with a SQL database (what I would think is a pretty vanilla thing to do). Since my database will be pretty large, I want to use incremental imports using this method http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport so the calls are of the form http://localhost:8983/solr/Items/dataimport?command=full-import&clean=false. This works perfectly well for adding items.

I have a separate DeletedItems table in my database which contains the primary keys of the items that have been deleted from the Items table, along with when they were deleted. As part of the DataImport call, I had hoped to be able to delete the relevant items from my index based on a query along the lines of

SELECT Id FROM DeletedItems WHERE WasDeletedOn > '${dataimporter.last_index_time}'

but I can't figure out how to do this. The link above alludes to it with the cryptic

In this case it means obviously that in case you also want to use deletedPkQuery then when running the delta-import command is still necessary.

but setting deletedPkQuery to the above SQL query doesn't seem to work. I then read that deletedPkQuery only works with delta-imports, so I am forced to make two requests to my solr server as part of the sync process? This doesn't seem right as the operations are parameterized by the dataimporter.last_index_time property, which changes. Both steps would need to be done in one "atomic" action, surely? Any ideas?

2 Answers2

1

You must use the import handler special commands

https://wiki.apache.org/solr/DataImportHandler#Special_Commands

With these commands you can alter the boost or delete a document coming from the recordset of the full import query. Be aware that you must use the $skipDoc field to avoid that the document gets indexed again and that you must repeat the id in the $deleteDocById field.

You can use a union query

select 
  id, 
  text, 
  'false' as [$deleteDocById],
  'false' as [$skipDoc]
from [rows to update or add]
Union Select
   id, 
   '' as text,
   id as [$deleteDocById],
   true as [$skipDoc]

or a case when

select
  id,
  text,
  CASE
    when deleted = 1 then id
    else 'false'
  END as [$deleteDocById],
  CASE
    when deleted = 1 then 'true'
    else 'false'
  END as [$skipDoc]
  Where updated > ${dih.last_index_time}
Jokin
  • 4,188
  • 2
  • 31
  • 30
0

The deletedPkQuery is run as part of the regular call to delta-import, so you don't have to run anything twice (and when doing a full-import, there's no need to run deletedPkQuery, since the whole connection is cleared before importing anyway).

The deletedPkQuery should be configured on the same element as your main query. Be sure to match the field names exactly as well, and that the id produced by your deletedPkQuery matches the one provided by the main query.

There's a minimal example on solr.pl for importing and deleting fields using the same deleted_entries-table structure as you have here:

<entity 
  name="album" 
  query="SELECT * from albums"
  deletedPkQuery="SELECT deleted_id as id FROM deletes WHERE deleted_at > '${dataimporter.last_index_time}'"
>

Also make sure that the format of the deleted_at-field is comparable against the value produced by last_index_time. The default is yyyy-MM-dd HH:mm:ss.

.. and lastly, remember that the last_index_time property isn't available before the second time the task is run, since there's no "previous index time" the first time an index is being populated (but the deletedPkQuery shouldn't run before that anyway).

MatsLindh
  • 49,529
  • 4
  • 53
  • 84
  • Thanks for the reply. I'm trying to avoid delta-imports though, as they seem to issue a call to the database for each PK returned by the deltaQuery query (so n+1 in all), when a single one would do. From the Solr documentation: "Note: there is an alternative approach for updating documents in Solr, which is in many cases more efficient and also requires less configuration explained on http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport ". The method they explain there works very nicely for incremental imports, but I can't understand how they intend deletes to be configured. – Chris Kerridge Jul 23 '16 at 20:39
  • Note, it's a full-import that's used but it *doesn't* clear the index, just adds/updates documents – Chris Kerridge Jul 23 '16 at 20:40
  • Ah, good point. I missed that detail. I guess you'll have to issue two commands - one for the full import, then one delta-import to trigger the deletion afterwards. Deletions with full-imports are usually handled by using `clean=true`. To make deletions work with the technique described on the DeltaQueryViaFullImport, you'll have to handle them manually (through a deleted-field, which you then purge with deleteByQuery afterwards or through a delta-import query that triggers the deletions). How many documents are you deleting between imports? – MatsLindh Jul 24 '16 at 10:32
  • Users would upload potentially very many, date related items, which they could delete (I'm also considering purging elapsed items to keep the index small). I would like these to be made searchable ASAP so I'm planning to schedule regular calls to update from the DB. The application itself will be making routine DB calls, so ideally I'd like to avoid delta-imports hogging connections with their n+1 calls. A 2nd 'post-import' call to delta-import makes me uneasy due to race-conditions. Seems I may need to tweak some Java in DIH, but then syncing to a DB seems a routine use case... – Chris Kerridge Jul 24 '16 at 16:08