1

Searching for entries in a keyword-field stored in an Azure Cognitive Search index not returning expected results when searching for them in a long text. Multi word tokens as 'microsoft azure' are not returned as a match when looking in the text "This text contains microsoft azure"

Working with Azure Cognitive Search with the Python SDK. Say I'm building a search index where each document in the index has a "name" field. The name field (which can consist of multiple words) only makes sense if I tokenize the entire name is one token, so I use the "keyword_v2" tokenizer when building the analyzer for this field.

from azure.search.documents.indexes.models import CustomAnalyzer

# Define the custom analyzer for the Name field
name_analyzer = CustomAnalyzer(name="name_analyzer",tokenizer_name="keyword_v2",
    token_filters=["lowercase"])`

# Specify the index schema

fields = [
        SimpleField(name="key", type=SearchFieldDataType.String, key=True),       
        SearchableField(name="name", type=SearchFieldDataType.String, analyzer_name="name_analyzer", searchable=True)
    ]

This works as expected when I test the analyzer using the REST API. As an example I have the following indexed entries in the name-field: ['microsoft azure', 'amazon aws', 'google cloud']. The custom analyzer I set up correctly tokenizes each entry as one token and not as multiple tokens (ex. 'microsoft' and 'azure').

The problem occurs when I search for the stored names in a text.

text_example = "This is a text containing microsoft azure."

results = search_client.search(search_text=text_example, include_total_count=True, select= ['name'], search_fields= ['name'], highlight_fields= 'name', query_type= "full")

print ('Total Documents Matching Query:', results.get_count())
for result in results:
    print(result)

I expect when I search for a name in the text_example it will return a hit on 'microsoft azure', but it doesn't. It returns empty. I suspect because I use the same custom analyzer as both the index analyzer and search analyzer, it will tokenize the entire text_example as one token, which is not in the index. So it returns nothing.

Can I resolve this problem of searching for multiple word tokens in a long text in an efficient way using Azure Cognitive Search ?

Basically, I want the equivalent of the following Python snippet but with Azure Cognitive Search Index:

name_list = ['microsoft azure', 'amazon aws', 'google cloud']
text_example = "This is a text containing microsoft azure."

for name in name_list:
    if name in text_example:
        print(f"Match found: {name}) #Ideally it would return the field you are looking for and just the matched term.
lex-mw-lab
  • 11
  • 2

2 Answers2

0

For Index Analyzer its fine to use keyword_v2 if you are using filtering on that field. But if you want to satisfy your use case, would rather suggest to use a standard analyzer for both index and search and then use a phrase search that requires term. So you text_example would be "This is a text containing "microsoft azure"."

  • Thanks for the answer, but this would only work if I know the search text in advance, which makes the search index kind of redundant. Maybe I'm misinterpreting your answer. – lex-mw-lab Aug 07 '23 at 14:19
0

To resolve this issue, you can try using a different analyzer for the search query.

You can create a custom analyzer that uses the standard_v2 tokenizer for the search query and apply it to the search text.

Below is the update Analyzer and Schema code snippet:


name_analyzer = CustomAnalyzer(name="name_analyzer", tokenizer_name="keyword_v2", token_filters=["lowercase"])


fields = [
    SimpleField(name="key", type=SearchFieldDataType.String, key=True),
    SearchableField(name="name", type=SearchFieldDataType.String, indexAnalyzer="name_analyzer", searchAnalyzer="standard_v2", searchable=True),
]

With the above schema I created an index and uploaded sample data:

documents = [
    {"key": "1", "name": "microsoft azure"},
    {"key": "2", "name": "amazon aws"},
    {"key": "3", "name": "google cloud"},
]

With above setup I was able to get the required results.

Search Query Code:

text_example = "This is a text containing microsoft azure."

results = search_client.search(search_expression=f"name: '{text_example}'", include_total_count=True, select=['name'], search_fields=['name'], highlight_fields='name', query_type="full")

print('Total Documents Matching Query:', results.get_count())
for result in results:
    print(result)

Result: enter image description here

RishabhM
  • 525
  • 1
  • 5
  • Thanks for the answer, but the search would return every hit on 'microsoft' or 'azure' which is not ideal. If possible the only hit would be if 'microsoft azure' was found. – lex-mw-lab Aug 07 '23 at 14:16