Python/Scipy: KDTree Query Ball Point performance issue

Question

This question relates to the Kaggle Two Sigma Rental Listings Challenge. It contains a training data set with approximately 49.000 rows. In terms of feature engineering I'm trying to calculate the following two features:

The minimum distance to any other listing, which indicates the density of the listings in this area. Assumption: The more dense, the more interest.
The number of listings in 500m radius. Assumptions: a) The more listings close to my listing, the more likely is a higher interest. b) if the address of those listing differs, it's more likely that the listing is on a large cross road.

To do the above, I used the KDTree from Scipy. For the question scroll down. For more details, continue reading.

scipy.spatial.KDTree

Therefore I had to transform the longitude and latitude into cartesian coordinates.

import pandas as pd

df = pd.read_json('data/train.json')

from math import *

def to_Cartesian(lat, lng):
    R = 6367

    lat_, lng_ = map(radians, [lat, lng])

    x = R * cos(lat_) * cos(lng_)
    y = R * cos(lat_) * sin(lng_)
    z = R * sin(lat_)
    return x, y, z

df['x'], df['y'], df['z'] = zip(*map(to_Cartesian, df['latitude'], df['longitude']))

Then I created a KDTree from the X, Y, Z cartesian coordinates thus allowing me to use metric distances (kilometers and meters) with the KDTree.

coordinates = list(zip(df['x'], df['y'], df['z']))

from scipy import spatial
tree = spatial.KDTree(coordinates)

Now that I got the KDTree as an index, I'm able to query against it. To answer my first question from above, I used the KDTree.query method.

import sys

def get_min_distance(row, data, tree):
    # get the carthesian coordinates of the listing that is a prerequisite to query against the kd-tree
    coords = row['x'], row['y'], row['z']
    # query the 3 listings that are closest to the current listing
    closest = tree.query(coords, 3)
    # first array contains the euclidian distances. second array contains the indices
    distances, indices = closest[0], closest[1]

    # how many results did we get?
    length = len(distances)
    mdist = sys.maxsize

    # start at 1 to skip the original coordinates at index 0
    for i in range(1, length):
        idx = indices[i]

        distance = distances[i]
        if distance < mdist:
            mdist = distance

    return mdist

df['min_distance_km'] = df.apply(lambda row: get_min_distance(row, df, tree), axis=1)

Then I tried applying the KDTree.query_ball_point method in order to answer my second question from above, which is finding the listings in a 500 meter radius around a single listing. Problem: This eats up my 8GB RAM and never finishes. Since the KDTree is a spatial index, this should be finished in no time. So what am I doing wrong?

#def get_neighbors(x, y, z):
def get_neighbors(row):
    # get the carthesian coordinates to query against the kd-tree
    #coords = [x, y, z]
    coords = row['x'], row['y'], row['z']
    # query the indicies of those listings that are in a close range of 500 meters.
    indices = tree.query_ball_point(coords, 0.5)

    # how many results did we get? (minus 1 because the listing itself is included as well)
    length = len(indices)
    addresses = [] * (length-1)

    for i in range(1, length):
        idx = indices[i]
        addresses.append(idx)
        #address = df.get_value(df.index[idx], 'display_address')
        #addresses.append(str(address))

    return addresses
    #return length-1, addresses, len(set(addresses))
    #return length-1, len(set(addresses))

df['neighborhood'] = df.apply(lambda row: get_neighbors(row), axis=1)
#df['neighborhood'], df['addresses'], df['cnt_addresses'] = zip(*map(get_neighbors, df['x'], df['y'], df['z']))

#df[df['neighborhood'] > 0].head(n=2)
df.sort_values(by='neighborhood', ascending=False).head(n=5)
#df.head(n=10)
#df[df['neighborhood'] > df['addresses']].head(n=2)

Update: I also tried the batch-method like this:

results = tree.query_ball_point(coordinates, 25)
len(results)

One issue: KDTree is optimized for batch querying, so using ``apply`` to get neighbors of one row at a time will not be the most efficient approach. That said, this should work, but just be marginally slower than a batch query... this makes me wonder if there is some kind of memory leak that becomes significant when you call the query function 49,000 times in a single script. — jakevdp, Mar 31 '17 at 16:26
you're right. I also tried the batch-version, but just forgot to add it. I updated the code above. But still, the problem remains. — Matthias, Mar 31 '17 at 20:27

Python/Scipy: KDTree Query Ball Point performance issue

0 Answers0