Can't find strings that aren't words in Django Haystick/Elasticsearch

Question

I'm using Django Haystack with Elasticsearch as the backend for a real-time flight mapping service.

I have all my search indexes set up correctly, however, I'm having trouble returning results for searches that aren't full words (such as aviation callsigns, some of which take the style N346IF, others include full words such as in Speedbird 500). The N346IF style of query doesn't yield any results, whereas I can easily return results for the latter example.

I make my query as below:

queryResults = SearchQuerySet().filter(content=q) # where q is the query in string format

(note that in the past I used the AutoQuery queryset, but the documentation lists that this only tracks words, so I'm passing a raw string now).

I have my search index fields setup as EdgeNgramField with search templates.

I have a custom backend with the following index settings (as well as both the snowball analyzer and the pattern analyzer):

ELASTICSEARCH_INDEX_SETTINGS = {
    'settings': {
        "analysis": {
            "analyzer": {
                "ngram_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["haystack_ngram"]
                },
                "edgengram_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["haystack_edgengram"]
                }
            },
            "tokenizer": {
                "haystack_ngram_tokenizer": {
                    "type": "nGram",
                    "min_gram": 4,
                    "max_gram": 15,
                },
                "haystack_edgengram_tokenizer": {
                    "type": "edgeNGram",
                    "min_gram": 4,
                    "max_gram": 15,
                    "side": "front"
                }
            },
            "filter": {
                "haystack_ngram": {
                    "type": "nGram",
                    "min_gram": 4,
                    "max_gram": 15
                },
                "haystack_edgengram": {
                    "type": "edgeNGram",
                    "min_gram": 4,
                    "max_gram": 15
                }
            }
        }
    }
}

ELASTICSEARCH_DEFAULT_ANALYZER = "pattern"

My backend is configured as:

class ConfigurableElasticBackend(ElasticsearchSearchBackend):

    def __init__(self, connection_alias, **connection_options):
        super(ConfigurableElasticBackend, self).__init__(
                                connection_alias, **connection_options)
        user_settings = getattr(settings, 'ELASTICSEARCH_INDEX_SETTINGS')
        if user_settings:
            setattr(self, 'DEFAULT_SETTINGS', user_settings)

class ConfigurableElasticBackend(ElasticsearchSearchBackend):

    DEFAULT_ANALYZER = "pattern"

    def __init__(self, connection_alias, **connection_options):
        super(ConfigurableElasticBackend, self).__init__(
                                connection_alias, **connection_options)

        user_settings = getattr(settings, 'ELASTICSEARCH_INDEX_SETTINGS')
        user_analyzer = getattr(settings, 'ELASTICSEARCH_DEFAULT_ANALYZER')

        if user_settings:
            setattr(self, 'DEFAULT_SETTINGS', user_settings)
        if user_analyzer:
            setattr(self, 'DEFAULT_ANALYZER', user_analyzer)

    def build_schema(self, fields):
        content_field_name, mapping = super(ConfigurableElasticBackend,
                                              self).build_schema(fields)

        for field_name, field_class in fields.items():
            field_mapping = mapping[field_class.index_fieldname]

            if field_mapping['type'] == 'string' and field_class.indexed:
                if not hasattr(field_class, 'facet_for') and not \
                                  field_class.field_type in('ngram', 'edge_ngram'):
                    field_mapping['analyzer'] = self.DEFAULT_ANALYZER
            mapping.update({field_class.index_fieldname: field_mapping})
        return (content_field_name, mapping)

class ConfigurableElasticSearchEngine(ElasticsearchSearchEngine):
    backend = ConfigurableElasticBackend

What would be the correct setup in order to successfully yield results for search patterns that are both and/or N346IF-style strings?

Appreciate any input, apologies if this is similar to another question (could not find anything related to it).

edit: requested by solarissmoke, the schema for this model:

class FlightIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.EdgeNgramField(document=True, use_template=True)
    flight = indexes.CharField(model_attr='flightID')
    callsign = indexes.CharField(model_attr='callsign')
    displayName = indexes.CharField(model_attr='displayName')
    session = indexes.CharField(model_attr='session')

    def prepare_session(self, obj):
        return obj.session.serverId

    def get_model(self):
        return Flight

Text is indexed as:

flight___{{ object.callsign }}___{{ object.displayName }}

I think we need to to see the schema for the fields that you are indexing. Please post the index definition. — solarissmoke, Jul 10 '16 at 02:38
@solarissmoke - just edited it. Please let me know if you think anything else is needed. — Cameron, Jul 10 '16 at 09:10
I am reasonably sure that I know what the issue is, but to help me confirm can you provide sample `Flight` data (`callsign`, `displayName`) that you can successfully search, and some that you can't, and the associated search query? — solarissmoke, Jul 10 '16 at 09:44
Sure thing! A query that works: `callsign` is `United 55`, `displayName` is `Tsuyoshi Hiroi` - with the query `United`, or `United 55`, the results are returned. One that doesn't work: `callsign` is `N133TC`, `displayName` is `Shahrul Nizam` and this query does not work by querying the callsign (where the query content is `N133TC`). However, this query works if the display name is used, `Shahrul` yields results). — Cameron, Jul 10 '16 at 09:51
Hmm, that is not what I expected. Can you confirm that in your `text` document the `___` are underscores and not spaces? — solarissmoke, Jul 10 '16 at 10:33
They are underscores yep. Wasn't sure if I was formatting this template correctly... — Cameron, Jul 10 '16 at 10:38

score 1 · Answer 1 · answered Jul 10 '16 at 10:42

It doesn't fully explain the behaviour you are seeing, but I think the problem is with how you are indexing your data - specifically the text field (which is what gets searched when you filter on content).

Take the example data you provided, callsign N133TC, flight name Shahrul Nizam. The text document for this data becomes:

flight___N133TC___Shahrul Nizam

You have set this field as an EdgeNgramField (min 4 chars, max 15). Here are the ngrams that are generated when this document is indexed (I've ignored the lowercase filter for simplicity):

flig
fligh
flight
flight_
flight___
flight___N
flight___N1
flight___N13
flight___N133
flight___N133T
flight___N133TC
Niza
Nizam

Note that the tokenizer does not split on underscores. Now, if you search for N133TC, none of the above tokens will match. (I can't explain why Shahrul works... it shouldn't, unless I've missed something, or there are spaces at the start of that field).

If you changed your text document to:

flight N133TC Shahrul Nizam

Then the indexed tokens would be:

flig
flight
N133
N133T
N133TC
Shah
Shahr
Shahru
Shahrul
Niza
Nizam

Now, a search for N133TC should match.

Note also that the flight___ string in your document generates a whole load of (most likely) useless tokens - unless this is deliberate you may be better off without it.

This makes a lot more sense now, thanks for answering. I was unaware of how the tokeniser split. However, this doesn't do the trick, the `N133TC` callsign patterns are still not matching. Not sure if this has anything to do with the filters set... By the way, for reference, I had set the `flight__` prefix because the search handles different models, and I wanted to differentiate from them on the front end more easily. I'll switch and use another key to define these. — Cameron, Jul 10 '16 at 12:04
I suspect the issue is with how your data is being indexed but cannot see what it could be. You may need to use the Analysis API to see exactly what your analyzed data looks like. — solarissmoke, Jul 11 '16 at 04:31
Thanks for your help - much appreciated. Just a quick update, experimenting with this to find the right pattern; https://github.com/polyfractal/elasticsearch-inquisitor — Cameron, Jul 12 '16 at 16:42

score 0 · Accepted Answer · edited May 23 '17 at 10:28

Solving my own question - appreciate the input by solarissmoke as it has helped me track down what was causing this.

My answer is based on Greg Baker's answer on the question ElasticSearch: EdgeNgrams and Numbers

The issue appears to be related to the use of numeric values within the search text (in my case, the N133TC pattern). Note that I was using the snowball analyzer at first, before switching to pattern - none of these worked.

I adjusted my analyzer setting in settings.py:

"edgengram_analyzer": {
    "type": "custom",
    "tokenizer": "standard",
    "filter": ["haystack_edgengram"]
}

Thus changing the tokenizer value to standard from the original lowercase analyzer used.

I then set the default analyzer to be used in my backend to the edgengram_analyzer (also on settings.py):

ELASTICSEARCH_DEFAULT_ANALYZER = "edgengram_analyzer"

This does the trick! It still works as an EdgeNgram field should, but allows for my numeric values to be returned properly too.

I've also followed the advice in the answer by solarissmoke and removed all the underscores from my index files.

Can't find strings that aren't words in Django Haystick/Elasticsearch

2 Answers2