8

I'm looking up country by ip range for tens of millions rows. I'm looking for a faster way to do the lookup.

I have 180K tuples in this form:

>>> data = ((0, 16777215, 'ZZ'),
...         (1000013824, 1000079359, 'CN'),
...         (1000079360, 1000210431, 'JP'),
...         (1000210432, 1000341503, 'JP'),
...         (1000341504, 1000603647, 'IN'))

(The integers are ip addresses converted into plain numbers.)

This does the job right, but just takes too long:

>>> ip_to_lookup = 999
>>> country_result = [country
...                   for (from, to, country) in data
...                   if (ip_to_lookup >= from) and
...                      (ip_to_lookup <= to)][0]
>>> print country_result
ZZ

Can anyone point me in the right direction to a faster way of doing this lookup? Using the method above, 100 lookups take 3 seconds. Meaning, I think, 10M rows will take several days.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
exzackley
  • 365
  • 1
  • 2
  • 8
  • 1
    First obvious micro-optimization: `country_result = next(country for from,to,country in data if ip_to_lookup >= from and ip_to_lookup <= to)` – agf Mar 17 '12 at 17:52
  • Second: Sort the `data` so you only need to test the lower bound -- as soon as you've crossed it, you're in the right range. – agf Mar 17 '12 at 17:52
  • 1
    `from` is a keyword and cannot be used as a variable name. – Karl Knechtel Mar 17 '12 at 20:42

2 Answers2

10

You can use the bisect module to perform a binary search after you sorted the dataset:

from operator import itemgetter
import bisect

data = ((0, 16777215, 'ZZ'), (1000013824, 1000079359, 'CN'), (1000079360, 1000210431, 'JP'), (1000210432, 1000341503, 'JP'), (1000341504, 1000603647, 'IN'))
sorted_data = sorted(data, key=itemgetter(0))
lower_bounds = [lower for lower,_,_ in data]

def lookup_ip(ip):
    index = bisect.bisect(lower_bounds, ip) - 1
    if index < 0:
        return None
    _, upper, country = sorted_data[index]
    return country if ip <= upper else None

print(lookup_ip(-1))          # => None
print(lookup_ip(999))         # => 'ZZ'
print(lookup_ip(16777216))    # => None
print(lookup_ip(1000013824))  # => 'CN'
print(lookup_ip(1000400000))  # => 'IN'

The algorithmic complexity of the lookup is O(log n) here, instead of O(n) for a complete list walk.

Niklas B.
  • 92,950
  • 18
  • 194
  • 224
  • I have been working on some wrappers for this sort of thing, to create the abstraction of a mapping with (possibly semi-infinite at either end) non-overlapping ranges for keys instead of individual values. – Karl Knechtel Mar 17 '12 at 20:44
  • Thank you Niklas, that really helped. The bottleneck is no longer the lookup. I have a new bottleneck: loading the 17million rows into memory to do the lookups on. I'm breaking it up into 8000-row chunks, doing the lookups and forming one big insert statement. The insert is fast. But getting the rows to process into memory is slow. It'll get the job done by morning. But I'm curious if I'm doing something wrong. It's taking 15-20 seconds to load in 8000 rows. – exzackley Mar 18 '12 at 06:54
  • @user567784: This is a totally unrelated problem. Please consider to accept one of the questions here and create a new question for your new problem. – Niklas B. Mar 18 '12 at 14:35
  • Very late to the thread, but beware: this answer will give not-quite-right answers if there are ranges in `data` that share a starting point (e.g. `10.0.0.0/16` and `10.0.0.0/8` IP ranges). The `py-radix` library on `pip` is a good solution for this problem of IP address lookups. – bbayles Mar 17 '16 at 03:00
1

Assuming your situation meets some requirements, there is a way to get the runtime complexity to O(1) on average, but space complexity suffers.

  1. The data must be static; all data must be processed before any lookups.
  2. Given an arbitrary IP, there must be a way to determine its significant octets.
  3. There must be enough space to add a key for every significant value for each country.

Below is a very naive implementation. It selects the first two octets of the IP as significant no matter what, then concatenates the significant octets as integers and incrementally adds a key for each value between the mininum and maximum. As you can probably tell, there is much room for improvement.

from socket import inet_ntoa
from struct import pack

data = ((0,             16777215,   'ZZ'),
        (1000013824,    1000079359, 'CN'),
        (1000079360,    1000210431, 'JP'),
        (1000210432,    1000341503, 'JP'),
        (1000341504,    1000603647, 'IN'))

def makedict(minip, maxip, country):
    d = {}
    for i in xrange(key(minip), key(maxip)+1):
        d[i] = country
    return d

def key(ip):
    octets = inet_ntoa(pack('>L', ip)).split('.')
    return int("".join(octets[0:2]));

lookup = {}
for lo, hi, cnt in data:
    lookup.update(makedict(lo, hi, cnt))

print lookup[key(999)]          # ZZ
print lookup[key(1000215555)]   # JP
Matt Eckert
  • 1,946
  • 14
  • 16
  • This can be made a bit more generic by keeping a list of possible countries along with their associated ranges inside the dict, instead of a single country code. If a key is looked up, one can then just walk the (hopefully not too long) list of countries and find the correct one. Also, your `key` function can be improved to `(ip & 0xffff0000) >> 16`, without the need for struct unpacking. – Niklas B. Mar 17 '12 at 21:12