1

I'm working on a Solr dataimport from an Oracle database. The database system has a set of tables dedicated to storing references to changes in other tables. For example, I might have a table named PERSON, and when records are added to this table, their IDs are added to the PERSON_CHANGED table. I'd like to use this PERSON_CHANGED table when defining my deltaQuery so that Solr only indexes the changed records in subsequent indexes. As part of this process, I need to remove records that I've read from the PERSON_CHANGED table after Solr finishes its import (either delta or full), so that I don't process them again later.

What's the best way to run this kind of "cleanup" SQL query after a dataimport?

I've tried combining both of the queries like this (simplified for brevity):

<dataConfig>
    <dataSource ... >
    <document>
        <entity name="person"
                query="
                    SELECT ID, FIRST_NAME, LAST_NAME
                    FROM PERSON
                    WHERE '${dataimporter.request.clean}' != 'false'
                        OR PERSON_ID IN (
                            SELECT ID FROM CHANGED_PERSON
                        );

                    DELETE * (
                        SELECT * FROM CHANGED_PERSON
                    );
        " />
    </document>
</dataConfig>

But this results in a SQL command not properly ended error. Does Solr provide a way to do this kind of cleanup?

Nathan Friend
  • 12,155
  • 10
  • 75
  • 125
  • hi, what's your SOLR version ? – jeorfevre Jun 08 '17 at 14:46
  • @jeorfevre - I'm using Solr 6.5.1. – Nathan Friend Jun 08 '17 at 14:48
  • Your Delete statement isn't specifying which table is to be used for deletion, `DELETE person where person_id in ( SELECT ID FROM CHANGED_PERSON );` – Jair Hernandez Jun 08 '17 at 14:53
  • Also I've just noticed an alias reference in your select statement which is not defined `p.PERSON_ID`. Be sure to append 'p' as the PERSON's table alias. – Jair Hernandez Jun 08 '17 at 15:02
  • Thanks @JairHernandez. I simplified the query for this post and forgot to remove the alias. Both queries (the SELECT and the DELETE) are both working correctly when I run them in my SQL client application. And it seems like the table name isn't necessary in the DELETE statement - I'm following the pattern from here: https://stackoverflow.com/a/33262653/1063392. – Nathan Friend Jun 08 '17 at 15:07
  • Glad to help @NathanFriend , and thanks to you for pointing out this DELETE syntax that I wasn't aware of. – Jair Hernandez Jun 08 '17 at 15:21

2 Answers2

0

Once you're using delta import in SOLR, solr won't process twice your record, since you will keep track of this records every time you will run

Ref doc:

When delta-import command is executed, it reads the start time stored in conf/dataimport.properties.

link: https://wiki.apache.org/solr/DataImportHandler#Delta-Import_Example

As part of your question, I can imagine that you're trying to perform full import every time that you run the deltaimport (full import runs cleanup in solr indexes ... etc). This is not the proper way to do deltaimport.

What I would recommand you is : 1) perform delta import (and not full import) 2) once every X days, X month, if your need to, perform a clean import Better to do it in another core, so that your service continues running and you will only replace the cores.

jeorfevre
  • 2,286
  • 1
  • 17
  • 27
  • Thanks for the response! I think the start time is only relevant if you build it into the query, like they do in their example (when they use `'${dih.last_index_time}'`). I'm not using the time stamp to determine what my deltas are - instead I'm using my `CHANGED_PERSON` table. My plan in to run a full import once, and then subsequent imports will use this `CHANGED_PERSON` table to determine what hasn't yet been indexed. This is why I need to be able to delete from this table after an import. – Nathan Friend Jun 08 '17 at 15:27
  • why not to use a timestamp in your PERSON table, and delete TABLE CHANGED_PERSON, so that you will use somr delta mecanics in a simple way. Do you need CHANGED_PERSON table ? – jeorfevre Jun 08 '17 at 15:32
  • I agree with you that it would be simpler to use a timestamp on the `PERSON` table like you suggest, but there's numerous reasons why we're using this `CHANGED_PERSON` system - mostly due to legacy concerns and other systems that interact with these tables. – Nathan Friend Jun 08 '17 at 15:36
0

I found a way to accomplish this cleanup task, but I'm not super happy with it. I can define a separate entity whose query runs a DELETE:

<dataConfig>
    <dataSource ... >
    <document>
        <entity name="person"
                query="
                    SELECT ID, FIRST_NAME, LAST_NAME
                    FROM PERSON
                    WHERE '${dataimporter.request.clean}' != 'false'
                        OR PERSON_ID IN (
                            SELECT ID FROM CHANGED_PERSON
                        )" />

        <entity name="deleteChangedPersonRecords"
                query="DELETE FROM CHANGED_PERSON" />
    </document>
</dataConfig>

This seems to work, but it's a bit of a hack, and it relies on the assumption that Solr executes its entity queries in the same order that they are specified in the file. If anyone has a better solution, please feel free to add your answer to this question.

Nathan Friend
  • 12,155
  • 10
  • 75
  • 125