This question relates to the Kaggle Two Sigma Rental Listings Challenge. It contains a training data set with approximately 49.000 rows. In terms of feature engineering I'm trying to calculate the following two features:
- The minimum distance to any other listing, which indicates the density of the listings in this area. Assumption: The more dense, the more interest.
- The number of listings in 500m radius. Assumptions: a) The more listings close to my listing, the more likely is a higher interest. b) if the address of those listing differs, it's more likely that the listing is on a large cross road.
To do the above, I used the KDTree from Scipy. For the question scroll down. For more details, continue reading.
scipy.spatial.KDTree
Therefore I had to transform the longitude and latitude into cartesian coordinates.
import pandas as pd
df = pd.read_json('data/train.json')
from math import *
def to_Cartesian(lat, lng):
R = 6367
lat_, lng_ = map(radians, [lat, lng])
x = R * cos(lat_) * cos(lng_)
y = R * cos(lat_) * sin(lng_)
z = R * sin(lat_)
return x, y, z
df['x'], df['y'], df['z'] = zip(*map(to_Cartesian, df['latitude'], df['longitude']))
Then I created a KDTree from the X, Y, Z cartesian coordinates thus allowing me to use metric distances (kilometers and meters) with the KDTree.
coordinates = list(zip(df['x'], df['y'], df['z']))
from scipy import spatial
tree = spatial.KDTree(coordinates)
Now that I got the KDTree as an index, I'm able to query against it. To answer my first question from above, I used the KDTree.query method.
import sys
def get_min_distance(row, data, tree):
# get the carthesian coordinates of the listing that is a prerequisite to query against the kd-tree
coords = row['x'], row['y'], row['z']
# query the 3 listings that are closest to the current listing
closest = tree.query(coords, 3)
# first array contains the euclidian distances. second array contains the indices
distances, indices = closest[0], closest[1]
# how many results did we get?
length = len(distances)
mdist = sys.maxsize
# start at 1 to skip the original coordinates at index 0
for i in range(1, length):
idx = indices[i]
distance = distances[i]
if distance < mdist:
mdist = distance
return mdist
df['min_distance_km'] = df.apply(lambda row: get_min_distance(row, df, tree), axis=1)
Then I tried applying the KDTree.query_ball_point method in order to answer my second question from above, which is finding the listings in a 500 meter radius around a single listing. Problem: This eats up my 8GB RAM and never finishes. Since the KDTree is a spatial index, this should be finished in no time. So what am I doing wrong?
#def get_neighbors(x, y, z):
def get_neighbors(row):
# get the carthesian coordinates to query against the kd-tree
#coords = [x, y, z]
coords = row['x'], row['y'], row['z']
# query the indicies of those listings that are in a close range of 500 meters.
indices = tree.query_ball_point(coords, 0.5)
# how many results did we get? (minus 1 because the listing itself is included as well)
length = len(indices)
addresses = [] * (length-1)
for i in range(1, length):
idx = indices[i]
addresses.append(idx)
#address = df.get_value(df.index[idx], 'display_address')
#addresses.append(str(address))
return addresses
#return length-1, addresses, len(set(addresses))
#return length-1, len(set(addresses))
df['neighborhood'] = df.apply(lambda row: get_neighbors(row), axis=1)
#df['neighborhood'], df['addresses'], df['cnt_addresses'] = zip(*map(get_neighbors, df['x'], df['y'], df['z']))
#df[df['neighborhood'] > 0].head(n=2)
df.sort_values(by='neighborhood', ascending=False).head(n=5)
#df.head(n=10)
#df[df['neighborhood'] > df['addresses']].head(n=2)
Update: I also tried the batch-method like this:
results = tree.query_ball_point(coordinates, 25)
len(results)