1

I have an Azure Search index composed of documents that can "occur" in multiple regions any number of times. For example Document1 has 5 occurrences in Region1, 20 occurrences in Region2. Document2 has 54 occurrences in Region1, and 10 occurrences in Region3. Document3 has 10 occurrences in Region3. We want to use Azure Search for searching and suggestions, but base the order on number of occurrences on a region. For example the search for Document from a user in Region1 should return in the order Document2, Document1, Document3 because Document2 has 54 occurrences in that region, while Document1 has 5 occurrences and Document3 has none.

[
  { 'name': 'Document1', 'regions': ['Region1|5', 'Region2|20'] },
  { 'name': 'Document2', 'regions': ['Region1|54', 'Region3|10'] },
  { 'name': 'Document3', 'regions': ['Region3|10'] }
]

I'm having a hard time figuring out how to structure the index or if it is even possible with Azure Search. Please note that the number of regions is potentially in the hundreds of thousands. I am ok with changing regions for center points and use geospatial functions instead, but I still don't see how to lay the data or query it.

What is the best way to structure the index and how would one make the query possible?

Jonas Stawski
  • 6,682
  • 6
  • 61
  • 106

1 Answers1

1

tl;dr - There might be a solution for you based on some assumptions I have. Please read on, and if possible try to provide some validations around my assumptions for me to give a better answer (if such an answer exists).

Unfortunately, Azure search doesn't have an out-of-the box approach for your scenario. There might be a work around however - instead of the regions collection being something like ['Region1|5', 'Region2|20'], you could try to structure the document such that it appears to be ['Region1', 'Region1',...., 'Region2', 'Region2', ...] (that is, make the collection contain n elements of Region1 and m elements of Region2 where in your case n = 5 and m = 10.

Then you should simply be able to search using the Region that the user originates from and I believe the results should be ordered based on which document's collection column (regions) contains more occurrences of the particular queried region.

This approach helps you in 2 ways:

  1. You could try adding each region as a column in the search index and use some queries to get the kind of result you want. However, since you mention there might be hundreds of thousands of such regions, it might not work well with our service limits. If however that's not the case, I highly recommend adding each region as a column, so that you can query/order by the column value.
  2. With the replication of the string approach, you can have arbitrarily large collections as I believe Azure search does not have any limitations with regard to the number of elements in a collection. Also the nice thing here is, if your document will have a sparse number of regions (i.e., you may have 100s of 1000s of regions, but any given document will only have few regions enumerated), you should be able to achieve what you want. If that's not the case however, this approach might not be super nice/efficient and might even be painful for you to manage.

Also, just FYI I'd recommend taking a look at the scoring profiles feature and especially the tag function to see if that might in any way be useful to you.

Arvind - MSFT
  • 561
  • 2
  • 6
  • yes, hundreds of thousands of regions so one column per region is not feasible. Will having the region X times result in a better scoring over another document that contains it X - N times? Also those are simplified cases a region count might be very big in the thousands. Would that affect performance? – Jonas Stawski Nov 14 '17 at 23:59
  • Unfortunately, here are some gotchas: 1. You might run into other limitations we have set for the service. Namely around payload size. If you visit: https://learn.microsoft.com/en-us/rest/api/searchservice/supported-data-types#edm-data-types-used-in-azure-search-indexes-and-documents you'll notice that we limit payload size to be 16MB even though there isn't a theoretical limit to the number of elements in the collection. 2. If the document you are indexing comes from blob store, we might truncate it and only get the first few MB (we emit a warning in this case) – Arvind - MSFT Nov 15 '17 at 02:21
  • I'd say give the approach a shot to see if it even plays nice with our service payload limitations. If it does, it might take a longer time to index the data, but search/querying performance should in theory not be affected and you should get the desired scoring that you indicated. This is a pretty atypical use case for Azure search/full-text search in general, which is why we don't have a better solution at the moment. – Arvind - MSFT Nov 15 '17 at 02:26
  • On my POC approach 2 works well with Search, but doesn't work with Suggestions as it seems we can't have complex terms. It was suggested to simply use Search instead of Suggestion to accomplish the same thing, but am I loosing something by doing so? – Jonas Stawski Nov 15 '17 at 18:23
  • I don't exactly understand your scenario for Suggestions? I assumed we were only talking about Search. Could you describe in more detail what this scenario is? – Arvind - MSFT Nov 15 '17 at 19:37
  • The scenario for suggestions is exactly the same as search. I want suggestions to be weighted by region, just like search. – Jonas Stawski Nov 15 '17 at 19:55
  • I am not sure what you mean by 'complex terms' - Did you take a look at suggesters? (Ref: https://learn.microsoft.com/en-us/rest/api/searchservice/suggesters) You can add the 'regions' field as the sourceField and then make use of the suggestions API (Ref: https://learn.microsoft.com/en-us/rest/api/searchservice/suggestions) -- the suggestions API should return the results in order of their TF-IDF score, which should be what you want. Any reason this doesn't work for you? Did you try this and not see the expected order of suggestions? – Arvind - MSFT Nov 15 '17 at 23:15
  • I tried passing ‘term and regions:1234’ in the suggestions api, but it doesn’t work – Jonas Stawski Nov 17 '17 at 19:06
  • I see what you mean by complex terms now -- so unfortunately, yeah in that case the suggestions might not work. I am not sure if there is an alternative -- one potential thing to try out would be to see if the fuzzy matching helps? On the suggesters link in my earlier comment, you will details about the fuzzy parameter. if that doesn't work, I am not sure if a solution exists that serves both your scenarios. – Arvind - MSFT Nov 18 '17 at 01:12
  • we chose not to weight the autocomplete with regions as it doesn't truly make sense in the context of auto complete. Thanks for your help – Jonas Stawski Nov 21 '17 at 14:49
  • So the approach of duplicating the region/tags does not work as expected. Just because the region is there multiple times the results are not ordered correctly: i.e. the record that contains the region the most times first and the one with the least last – Jonas Stawski Jan 04 '18 at 16:45
  • Interesting, it should work in theory. Are you sure you don't have other fields which might change the weight of the score? Also, you need to try it after you have a statistically large number of documents (trying it with 5-10 docs might not be useful). Azure search is currently working on support for complex types - when that releases you should be able to potentially make use of that feature for your data model. You can view the status of that feature here: https://feedback.azure.com/forums/263029-azure-search/suggestions/6670910-modelling-complex-types-in-indexes – Arvind - MSFT Jan 06 '18 at 00:46
  • Regarding scoring @JonasStawski, make sure you're not running into this problem when the number of documents is small: https://stackoverflow.com/questions/29814079/azure-search-scoring – Yahnoosh Jan 10 '18 at 16:18