How to query atom field with unicode value in Google App Engine production search?

Question

I wrote some text search with use Google App Engine search.

In SDK I tested such query on atom field:

u'tag:"wartości"'

In production I run the same query but it not works on same data.

How can I do unicode query on atom field?

Is it possible to use unicode in Google App Engine search?

score 1 · Answer 1 · answered Apr 28 '14 at 21:35

1

We are aware of this issue and plan to fix ASAP. The fix that we're currently planning will require that the atom field value include exactly the same accent characters in order to match. Matches will continue to be case-insensitive. We expect that at least initially, values that use combining diacritical marks will be treated as different values than those using precomposed characters. We may revisit that decision depending on feedback, but it's the most straightforward fix on our end.

For more on the precomposed characters vs. combining diacritical marks, see this Wikipedia article:

http://en.wikipedia.org/wiki/Precomposed_character

Chris

answered Apr 28 '14 at 21:35

Chris Bond

51
1

For Polish is need to allow use English alphabet + letters ąćęłóńśżź = this is **solid letters** ó is not o + accute accent - we not use i.e. qx letters. – Chameleon Apr 28 '14 at 23:18
Is it possible to search on atom with 'wartości' with any type of conversion now? – Chameleon Apr 28 '14 at 23:20

score 0 · Answer 2 · answered Apr 29 '14 at 15:15

It looks that I need translate AtomField values into new string and I need to translate queries too. This workaround will allow only Polish unicode search. I do not know tonkenization rules so I use 'q', 'x' to expand alphabet since not used in Polish.

# coding=utf-8

translate = {
  u'ą': u'aq',
  u'Ą': u'Aq',
  u'ć': u'cq',
  u'Ć': u'Cq',
  u'ę': u'eq',
  u'Ę': u'Eq',
  u'ł': u'lq',
  u'Ł': u'Lq',
  u'ń': u'nq',
  u'Ń': u'Nq',
  u'ó': u'oq',
  u'Ó': u'Oq',
  u'ś': u'sq',
  u'Ś': u'Sq',
  u'ż': u'zx',
  u'Ż': u'Zx',
  u'ź': u'zq',
  u'Ź': u'Zq',
}

import re

reTranslate = re.compile(u'(%s)' % u'|'.join(translate))

print reTranslate.pattern

test = u"""\
Właściwie prowadzona komunikacja wewnętrzna w firmie,\
 zwłaszcza dużej czy posiadającej rozproszoną sieć oddziałów,\
 może przynieść oszczędność czasu, a co za tym idzie, również pieniędzy."""

print reTranslate.sub(lambda match: translate[match.group(0)], test)

How to query atom field with unicode value in Google App Engine production search?

2 Answers2