0

I'm implementing a full text search based on the song database on GuitarParty.com. The data consists of lyrics in multiple languages, which is not a problem per se.

However, when search results are returned using snippeted_fields all accented characters within words, such as ÚúÉéÍí, are returned using their generic unaccented versions, UuEeIi.

This is how I form my query:

    query = search.Query( 
        query_string=qs, 
        options=search.QueryOptions(
            sort_options=search.SortOptions(
                #match_scorer=search.MatchScorer(),
                match_scorer=search.RescoringMatchScorer(),
                expressions=[
                    search.SortExpression(expression='_score + importance * 0.03', default_value=0)
                    #search.SortExpression(expression='_score', default_value=0)
                ],
                limit=1000,
            ),
            cursor=cursor, 
            returned_fields=['title','atomtitle','item', 'image'],
            snippeted_fields=['title','atomtitle', 'body','item'],
        )
    )

I'm pretty sure this is is not an encoding issue since everything looks just right if I pull my document fields directly (as I do with the titles). It's only the snippeted exoressions that display incorrectly.

To better see what I'm referring to you can take my test engine for a spin here: http://gp-search.appspot.com/ and search for something Icelandic. Example phrase: Vísur vatnsenda Rósu

This will return a document with this snippet:

Augun min og augun þin. O þa fogru steina. Mitt er þitt og þitt er mitt, þu veist hvað eg mei- na. Langt er siðan sa eg hann sannlega friður var hann.

Correctly spelled snippet should be:

Augun mín og augun þín. Ó þá fögru steina. Mitt er þitt og þitt er mitt, þú veist hvað eg mei- na. Langt er síðan sá ég hann sannlega friður var hann.

Am I better off generating my own snipped from the document data, or is there something I can do to pull snippets with accented characters within words?

Patrick Costello
  • 3,616
  • 17
  • 22
  • That's not an encoding issue (those look quite different). This looks deliberate, like they're normalizing to NFD and then stripping the accents (not too hard once you've got a Unicode normalization library). – Donal Fellows Oct 13 '13 at 20:49

1 Answers1

1

The data you put in gets normalized so that you dont have to worry about accents or missing accents when searching it.

Zig Mandel
  • 19,571
  • 5
  • 26
  • 36
  • Ok, so if I understand you correctly, I will need to generate my own snippets if I want them to be displayed with properly accented characters? – Kjartan Sverrisson Oct 14 '13 at 12:48
  • Ok, we are talking about two very different things. The snippets I'm referring to are the snippets generated from data I'm indexing using the Search API on Google App Engine, which is unrelated to CSE. https://developers.google.com/appengine/docs/python/search/options#Python_Snippets – Kjartan Sverrisson Oct 14 '13 at 16:27
  • Sorry, I put a bad url, I deleted the comment. – Zig Mandel Oct 14 '13 at 16:37