0

On this answer: https://stackoverflow.com/a/1554837/1135424 I found that an 'starts with' can be done using something like:

MyModel.all().filter('prop >=', prefix).filter('prop <', prefix + u'\ufffd')

It mentions that for doing an 'ends with' query would require storing the reverse of the string, then applying the same tactic as above.

So for example, if my current data are domain strings, something like:

domains | reverse_domain
------- | --------------
.com.io | oi.moc.
.com.eu | ue.moc.
.com.mx | xm.moc.

If I want to query for domains ending with '.io' I should do:

suffix = '.io'
MyModel.all().filter(
    'reverse_domain >=', suffix).filter(
    'reserve_domain <', suffix + u'\ufffd')

But when testing, doing a string comparison on a python command line i get this:

>>> '.com.io'[::-1] >= '.io'
True
>>> '.com.io'[::-1] < '.io' +  u'\ufffd'
False

Changing the order, first u'\ufffd' next the suffix

>>> '.com.io'[::-1] < u'\ufffd' + '.io'
True

So wondering if when doing an 'ends with', besides reversing the order of the stored data, the u'\ufffd' should go first, something like this:

MyModel.all().filter(
    'reverse_prop >=', suffix).filter(
    'reverse_prop <', u'\ufffd' + suffix)

Does the datastore filter follows the same lexicographical ordering that python does when comparing strings?

Basically how to do an:

SELECT domain FROM domains WHERE <domain name> LIKE CONCAT('%', domain)

For example, If I search for google.com.io, I could get the domain '.com.io', so, how to get a list of existing domains/strings that end with something?

Update:

While testing seems that I only need to change the operator >= to <=, that gives me the LIKE '%string':

suffix = '.io'[::-1]
MyModel.all().filter(
    'reverse_domain <=', suffix).filter(
    'reserve_domain <', suffix + u'\ufffd')

If I want to search if a string ends with some record that I already have:

>>> assert('.com.io'[::-1] <= '.com.io'[::-1] and '.com.io'[::-1] < '.com.io'[::-1] + u'\ufffd')

>>> assert('.com.io'[::-1] <= 'google.com.io'[::-1] and '.com.io'[::-1] < 'google.com.io'[::-1] + u'\ufffd') 

>>> assert('.com.io'[::-1] <= 'gle.com.io'[::-1] and '.com.io'[::-1] < 'gle.com.io'[::-1] + u'\ufffd')
Community
  • 1
  • 1
nbari
  • 25,603
  • 10
  • 76
  • 131
  • You have to reverse the suffix before searching for it, since the text in the index is reversed too. – Wooble Feb 25 '14 at 12:05
  • you meain something like ``'.com.io'[::-1] >= '.io'[::-1]``and ``'.com.io'[::-1] < '.io'[::-1] + u'\ufffd'`` – nbari Feb 25 '14 at 12:08
  • Right. You're literally just doing a prefix search on your reversed column, so you want to use the "same technique" as startswith to just do a startswith search for "oi." – Wooble Feb 25 '14 at 12:09
  • I made some tests, and besides reversing the suffix and the data I seems I have to change also the ``>=`` operator to ``<=``. – nbari Feb 25 '14 at 14:55

1 Answers1

0

If your use case involves searching for top-level domains, I would recommend splitting the URL into two separate properties. That will make it easy to find all records with a given TLD, and allow more flexibility for other searches.

You may also consider using an integer to represent each TLD, if you have millions of records. It will reduce the size of the data.

This approach may also be a little faster as you use a single equality filter instead of two inequality filters.

Andrei Volgin
  • 40,755
  • 6
  • 49
  • 58
  • You mean splitting the domain for example ['google','com','io'] and search from left to right io.com.google until a match is found ? – nbari Feb 25 '14 at 17:17
  • I mean using two separate properties - one for "google", and another one for "com.io". – Andrei Volgin Feb 25 '14 at 17:20
  • Could be, I just should split the current data per top level domains and so on. – nbari Feb 25 '14 at 17:26
  • I don't know your requirements, but you can also use "com.io" and other TLD as separate entity kinds. This is a good option if you never search across all domains. – Andrei Volgin Feb 25 '14 at 17:29
  • Could be, but maybe also creating an entity for "io", other for "com", and end with the domain, If i need to search for a particular domain I would just search for their parents instead something like IP->COM->gogole right ? – nbari Feb 25 '14 at 17:42
  • Not an entity - separate properties for the same entity. Splitting "com" and "io" is a possibility, if you need to access all "io" entities, for example. – Andrei Volgin Feb 25 '14 at 17:48