0

I'm trying to do knn search on big data with limited memory.

I'm using HDF5 and python.

I tried bruteforce linear search(using pytables) and kd-tree search (using sklearn)

It's suprising but kd-tree method takes more time(maybe kd-tree will work better if we increase batch size? but I don't know optimal size also it limited by memory)

Now I'm looking for how to speed up calculations, I think HDF5 file can be tuned for individual PC, also norm calculation can be speeded maybe using nymexpr or some python tricks.

import numpy as np
import time
import tables
import cProfile

from sklearn.neighbors import NearestNeighbors

rows = 10000
cols = 1000
batches = 100
k= 10

#USING HDF5
vec= np.random.rand(1,cols)
data = np.random.rand(rows,cols)
fileName = 'C:\carray1.h5'
shape = (rows*batches, cols)  # predefined size
atom = tables.UInt8Atom()  #?
filters = tables.Filters(complevel=5, complib='zlib') #?

#create
# h5f = tables.open_file(fileName, 'w')
# ca = h5f.create_carray(h5f.root, 'carray', atom, shape, filters=filters)

# for i in range(batches):
    # ca[i*rows:(i+1)*rows]= data[:]+i  # +i to modify data

# h5f.close()

#can be parallel?
def test_bruteforce_knn():
    h5f = tables.open_file(fileName)

    t0= time.time()
    d = np.empty((rows*batches,))
    for i in range(batches):
        d[i*rows:(i+1)*rows] = ((h5f.root.carray[i*rows:(i+1)*rows]-vec)**2).sum(axis=1)
    print (time.time()-t0)
    ndx = d.argsort()
    print ndx[:k]

    h5f.close()

def test_tree_knn():
    h5f = tables.open_file(fileName)

        # it will not work
    # t0= time.time()
    # nbrs = NearestNeighbors(n_neighbors=k, algorithm='ball_tree').fit(h5f.root.carray)
    # distances, indices = nbrs.kneighbors(vec)
    # print (time.time()-t0)

        #need to concatenate distances, indices somehow 
    t0= time.time()
    d = np.empty((rows*batches,))
    for i in range(batches):
        nbrs = NearestNeighbors(n_neighbors=k, algorithm='ball_tree').fit(h5f.root.carray[i*rows:(i+1)*rows])
        distances, indices = nbrs.kneighbors(vec)  # put in dict? 
        #d[i*rows:(i+1)*rows] = 
    print (time.time()-t0)
    #ndx = d.argsort()
    #print ndx[:k]

    h5f.close()

cProfile.run('test_bruteforce_knn()')
cProfile.run('test_tree_knn()')
mrgloom
  • 20,061
  • 36
  • 171
  • 301

1 Answers1

1

If I understand correctly your data has 1000 dimensions? If this is the case then it's expected that kd-tree won't fare well as it suffers from the curse of dimensionality.

You might want to have a look at Approximate Nearest Neighbors search methods instead. For instance have a look at flann.

ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • Yes 1k, maybe I mistaken but by kd-tree I mean NearestNeighbors in sklearn.neighbors.Yes I know about flann(it seems it have native sapport of hdf5), but it need to be compiled. – mrgloom Oct 25 '13 at 07:44
  • 1
    Well the scikit-learn implementation of kd-tree and ball-tree also need to be compiled. NearestNeighbors is a wrapper class that delegates the NN query to individual algorithms (kd-tree, ball-tree or bruteforce) depending on its constructor parameters. – ogrisel Oct 25 '13 at 09:01
  • I mean that scikit-learn is simple to install(without compiling), my tests show that using bruteforce with hdf5 with max compression level is faster than kd-tree, ball-tree or bruteforce in scikit-learn.I don't understand why *-tree methods are slower. – mrgloom Oct 29 '13 at 09:09
  • 1
    Because for high-dim data, the overhead of building a tree kills the performance. Also, a brute force approach will use vectorized CPU instructions (e.g. using fast linear algebra routine from BLAS) that is not possible to leverage as efficiently in a tree datastructure. – ogrisel Oct 29 '13 at 12:20
  • 1
    kd-tree and ball-tree typically work faster than bruteforce with `n_dimensions` in the `[1, 100]` range. – ogrisel Oct 29 '13 at 12:21
  • I found this for large dimensions http://www.slaney.org/malcolm/yahoo/Slaney2008-LSHTutorial.pdf "locality-sensitive hashing for finding nearest neighbors" .It seems it's not implemented in scikit-learn yet. – mrgloom Dec 06 '13 at 10:03
  • 1
    As I said flann already has state of the art methods implemented. Have a look at the papers in the publications of their website. They compare various methods that perform better than a naive data independent LSH. – ogrisel Dec 09 '13 at 07:45