Azure Search - database table is 2 million records but Azure Search index is 9 million

Question

Our SQL Server database table consists of approximately 2 million records. The Azure Search index we have that imports from it is now up to 9.5 million documents. Our indexer is set to run every 24 hours. We have Change Tracking turned ON in our database. We need to know how to properly configure the indexer to only import new records every 24 hours.

I am wondering if the issue is coming in with how our database writes its records: every 24 hours all 2 million records are gathered from the web and written to a staging table. Then a stored procedure compares the data in the staging table to the master data table. If the records for a certain Source ID are matched between the staging table and the master table, the staging table records will replace those records in the master table, even the identical records. The records in the master table with a source id that don't exist in the staging table are left alone. So even though most of those records being written to the master data table are just replacing existing identical data, the Azure Search indexer must see those as new records?

At any given time, if our database only has 2 million records, then the Azure Search index should only have 2 million documents. But it continues to add documents, seemingly based on the aforementioned and this is what we need to know how to fix.

EDIT: I noted an option for a "Soft Delete Policy." Is this something that would help us, as it would also delete documents in the index which correspond with records in the database that have been replaced?

EDIT #2: I've uploaded an Excel spreadsheet here which diagrams the flow I'm trying to explain, in case it helps to visualize it: https://dl.dropboxusercontent.com/u/8477791/v4AzureSearchIndexFlow.xlsx

In your index definition in Azure Search, have you marked the unique ID field from your SQL database as a Key? Assuming the ID remains constant, that should make Azure Search just update the documents rather than inserting new ones. The "soft delete" options is more about when you actually delete records. — flytzen, May 30 '17 at 17:43
Yes, the key is a field with a unique id. The problem is that those replacement records create a new unique id for those records (even though all the other data is the same as the one being replaced), therefore Azure Search index sees those as new records. But my understanding is that by having Change Tracking turned on for the database and having the watermark column indicated, the indexer should be able to determine what's new and what's not. — Stpete111, May 30 '17 at 18:25
Thanks for clarifying. You need to tell Azure Search to use the database change tracking, have you done that? See https://learn.microsoft.com/en-us/azure/search/search-howto-connecting-azure-sql-database-to-azure-search-using-indexers#capturing-new-changed-and-deleted-rows. Of course, that will only work with anything new so you'll have to reindex everything once. — flytzen, May 30 '17 at 18:44
Frans, thank you for your posts so far. I have the "Track Changes" checkbox checked already inside the Azure portal under my data source. Does this not accomplish the same thing as posting the dataChangeDetectionPolicy as shown in your link? — Stpete111, May 30 '17 at 19:50
Actually, it looks like you may be right. When I do a Get call to see the parameters of my datasource, it shows that my change detection policy is High Watermark, instead of Sql Integrated. So I'm hoping this will be the answer. However, I'm now trying to make the Put call to change the detection policy and I'm getting an error: "Change Detection policy cannot be changed for data source {my datasource} because the indexer {my indexer} references this data source and has a non-empty change tracking state, or it is currently in progress." Any ideas on that? It's definitely not in progress. — Stpete111, May 30 '17 at 20:24
I suspect you may need to drop the index and recreate it tbh... — flytzen, May 30 '17 at 20:40
Ok, I've successfully updated the change tracking policy to SQL integrated. Now I'm re-indexing. Then tonight will be the new index update. So by this time tomorrow I'll know if this solves my issue. Thanks for your help, Frans! — Stpete111, May 30 '17 at 20:49

score 0 · Answer 1 · answered May 30 '17 at 23:22

0

The fact that new documents continue to accumulate has nothing to do with change detection or deletion detection. The issue is that your source rows get new IDs when they change.

Azure Search bases document identity solely on the value of the key field of your index. Change tracking is about what is read from the data source; It has no bearing on how that data is indexed. The only factor there is the ID of each document. Documents with matching IDs will be updated, while IDs that aren't in the index will result in new documents being uploaded.

One possible workaround is to use a different column of your SQL table as the document key, but this only works if

Such a column exists that is actually unique across rows, and
The value of the column will not change as the data in the row changes.

answered May 30 '17 at 23:22

Bruce Johnston

8,344
3
32
42

To add to Bruce's answer: can you make your search index's key field same as the SQL table's primary key field? That will ensure that when you update a row in the table, the corresponding search document will be updated instead of duplicated. – Eugene Shvets May 31 '17 at 00:11
Sorry, maybe my explanation above of our process isn't clear. There is no "updating" of records. Every single piece of data that gets written to our database is a new record. Every record that gets added to our database gets a new key. The index's key field is already that primary key field. I believe the problem is that the Azure Search index is not deleting documents to correspond with the records that get deleted (and replaced) every 24 hours. So 2 questions: 1. what exactly does the SQL integrated change detection do if this has nothing to do w/ it? 2. Barring any other solutions, – Stpete111 May 31 '17 at 17:50
... would it be possible to automate a daily rebuild of the index? Any scenario in this regard would work as long as it could be done without the index name changing. – Stpete111 May 31 '17 at 17:52
Gentlemen, please see the Excel diagram I added to the main post - it may help to visualize the flow I'm trying to articulate. – Stpete111 May 31 '17 at 18:46
@Stpete111 We understand the flow as you described it. Unfortunately, it is fundamentally at odds with how indexers work. If you can't maintain the values of your keys, you will always end up with new documents in your index, because the value of a key is how Azure Search decides whether you're indexing a new document or updating an existing document. Change detection is about what's changed in the DB since the last indexer run, and it is based on either timestamps (for high watermark policy) or rowversions (for SQL integrated change tracking). Similarly, deletion policies won't help... – Bruce Johnston May 31 '17 at 19:19
@Stpete111 ...because they answer the question "for document with key X, is the corresponding row with key X marked as deleted?" As you can see, it's all based around key values. The easiest solution for you would be to introduce a surrogate key into your SQL table that does not change between updates. Otherwise, you'll have to rebuild the index, and since Azure Search doesn't support index name aliasing, you'll have to add logic to your application to handle swapping index names to avoid downtime. – Bruce Johnston May 31 '17 at 19:22
@BruceJohnston thanks a lot for your insight thus far. Just to make sure I understand - let's take a simple scenario - say Azure Search has an index on a database that has 10 records. We then delete 5 records from the database, and no other activity. On the next indexing from Azure Search, it won't delete those 5 documents from the index? – Stpete111 May 31 '17 at 19:56
@Stpete111 -- Deletion isn't the crux of your problem, but I'll walk through this scenario anyway, because it's sort of related. Whether those 5 documents get deleted or not depends on a few things. If you're using SQL ICT, AND the PKs of the 5 rows that were deleted match the document keys of 5 documents in the index, then those documents will be deleted from the index. This second requirement is what's important for your scenario. If you can't guarantee a correspondence between the PK of rows in the source table and the document keys in the index, none of this works. – Bruce Johnston May 31 '17 at 20:42
@Stpete111 If you want to discuss this further, feel free to email myself and Eugene directly (I believe you have our contact info) – Bruce Johnston May 31 '17 at 20:42
@BruceJohnston I'll reach out over email to you both shortly. Many thanks again. – Stpete111 May 31 '17 at 20:44

Azure Search - database table is 2 million records but Azure Search index is 9 million

1 Answers1