1

I have a complex identifier field that contains letters, numbers, white space, and special characters. I have been using the Keyword analyzer on this field, but having problems filtering results. Here is an example piece of data the field would contain:

O-2011-006953 /4

With the Keyword analyzer in place I'm able to do a contains filter on the index field using numbers, but not letters. The following filter works:

search.ismatch('/.*2011.*/', 'complex_identifier_field', 'full', 'all')

But if I try to do a contains search with a letter, I get 0 results:

search.ismatch('/.*O.*/', 'complex_identifier_field', 'full', 'all')

I believe my problem is I need another or custom analyzer, and I have recently tried to use the NGram analyzers, and tried to write a custom analyzer using the keyword tokenizer, but still unable to do a contains search on the field. How can I create a field that is one token; accepts alphanumeric characters, white space, and special characters; and allows me to do a contains filter to find any part of the identifier field?

UPDATE

Here is the definition of the field:

new Field("accession_number", DataType.String){ IsSearchable= true, IsFilterable = true, Analyzer = AnalyzerName.Keyword },

And here is the exact search I'm using:

var result = indexClient.Documents.Search(query, searchParameters: parameters);

where query = "print" and parameters =

{
Facets = null,
Filter = search.ismatch('/.*O.*/', 'accession_number', 'full', 'all'),
HighlightFields = null,
HighlightPostTag = null,
HighlightPreTag = null,
IncludeTotalResultsCount = true,
MinimumCoverage = null,
OrderBy = null,
QueryType = Full,
ScoringParameters = null,
ScoringProfile = null,
SearchFields = null,
SearchMode = All,
Select = (9 fields),
Skip = 0,
Top = 50
}
Ryan
  • 45
  • 5
  • Thanks for updating with extra details. I don't see anything wrong. To look further into this, it would help to see: 1) the JSON for the index definition (you can get this from the Azure Portal, go to the index and there's a tab for JSON), 2) the JSON for the document that should match but doesn't (you can use the query explorer in the Portal for this), and 3) the query that's failing, done directly in the Portal query explorer instead of through the API. Trying to remove layers while troubleshooting. – Pablo Castro Jun 25 '21 at 16:52

1 Answers1

0

In your example, the value O-2011-006953 /4 doesn't match the regex /.O./, because the regex requires a character before the "O" ("." means "exactly 1 character in that position"). If you want to match a substring anywhere within a token, you can use /.*O.*/ where "O" is the substring, "." means "any character", and "*" means "zero or more of the previous element, in this case the ".".

Note that this type of regex search can be slow and doesn't guarantee full recall (i.e. we may not return all documents that might match the regex, we limit how many terms we expand from the regex).

Pablo Castro
  • 1,644
  • 11
  • 11
  • Thanks for your response. I actually am using `/.*O.*/` as the regular expression. I think Stack Overflow may have removed the stars in my original post (perhaps because I didn't enclose them in a code block). So I believe the regular expression I'm using should have found the document. I'm not sure if this makes a difference, but I'm not using the regular expression on the search, only the filter. What is the recommended way to filter search results with a contains search? (Users are entering text to filter results after they search) – Ryan Jun 18 '21 at 13:27
  • I just tried the scenario (keyword analyzer, regex expanding against the start of the string, using search.ismatch()) and it worked. I wonder if we're missing another difference. Can you update the question with the index definition (at least the definition for this field) and the exact full request you're issuing? – Pablo Castro Jun 18 '21 at 22:48