0

I'm trying to get autocomplete working on my server for search. Here is an example of one of my indexer classes:

class ArtistIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    artist_name = indexes.CharField(model_attr='clean_artist_name', null=True)
    submitted_date = indexes.DateTimeField(model_attr='submitted_date')
    total_count = indexes.IntegerField(model_attr='total_count')

    # This is used for autocomplete
    content_auto = indexes.NgramField(use_template=True)

    def get_model(self):
        return Artist

    def index_queryset(self, using=None):
        """ Used when the entire index of a model is updated. """
        return self.get_model().objects.filter(date_submitted__lte=datetime.now())

    def get_updated_field(self):
        return "last_data_change"

The text and content_auto fields are populated using templates which, in the case of Artsts, is just the artist name. According to the docs, something like this should work for autocomplete:

objResultSet = SearchQuerySet().models(Artist).autocomplete(content_auto=search_term)

However, trying this with the string "bill w" returns Bill Stephney as the top result and then Bill Withers as the second result. This is because Bill Stephney has more records in the database, but Stephney shouldn't be matching this query: once the "w" is detected it should only match Bill Withers (and other Bill Ws). I've also tried wildcards:

objResultSet = SearchQuerySet().models(Artist).filter(content_auto=search_term + '*')

and

objResultSet = SearchQuerySet().models(Artist).filter(text=AutoQuery(search_term + '*'))

but the wildcard seems to cause a load of problems, with the development server hanging and eventually stopping due to a Write Failed: Broken Pipe error with a cryptic stack trace, all of which is within the Python framework. Has anyone managed to get this working properly? Is NgramField the right type to use? I've tried using EdgeNgramField but that gave me similar results.

benwad
  • 6,414
  • 10
  • 59
  • 93

1 Answers1

0

I believe the Haystack documentation recommends EdgeNgramField for "standard text," which I assume is English. They recommend NgramField for Asian languages or if you want to match across word boundaries. I.e., I think you want your content_auto to use EdgeNgramField:

 content_auto = indexes.EdgeNgramField(use_template=True)

Also, since n-grams are not exactly wildcard searches (in the way we use * [the asterisk] in shell script glob matches, for example), you should not use * in your filter.

One thing I have found that makes a difference in the search results are the parameters you can tweak in the backend engine -- there are settings for the n-gram tokenizer and n-gram filter. Depending on the search engine backend you're using, changing the min_gram values will affect the results you get in your matches.

I've only used the elasticsearch backend so I don't know if other backends are as sensitive to these n-gram settings as the solr/elasticsearch ones. Basically, I created a custom backend based on the default one that comes with haystack and tweaked the min_gram values to test the matches. The higher value you set the more "accurate" the match is since it has to match a longer token.

See this question on using a backend with custom n-gram settings for elasticsearch:

Community
  • 1
  • 1
kchan
  • 836
  • 8
  • 13