0

I have a dataset composed of millions of examples, where each example contains 128 continuous-value features classified with a name. I'm trying to find a large robust database/index to use to use as a KNN classifier for high-dimensional data. I tried Weka's IBk classifier, but it chokes on this much data, and even then it has to be loaded into memory. Would Lucene, specifically through the PyLucene interface, be a possible alternative?

I've found Lire, which seems to use Lucene in a similar way, but after reviewing the code, I'm not sure how they're pulling it off, or if it's the same thing I'm trying to do.

I realize Lucene is designed as a text indexing tool, and not as a general purpose classifier, but is it possible to use it in this way?

Cerin
  • 60,957
  • 96
  • 316
  • 522
  • 1
    To process "millions of examples", you should take a look on apache mahout - a distributed machine learning framework - it seems to have kNN: https://issues.apache.org/jira/browse/MAHOUT-115. – Skarab Apr 06 '11 at 21:28
  • I can't find any documentation for Mahout's KNN, other than a brief reference to it in the Taste component, which explicitly states it only supports boolean features. Mahout doesn't appear usable as a general purpose KNN. – Cerin Apr 07 '11 at 00:03

2 Answers2

1

Lucene doesn't seem like the right choice given what you've told us. Lucene would give you a way to store the data, but in terms of retrieval, it's not designed to do anything but search over textual strings.

Since K-NN is so simple, you might be better off creating your own data store in a typical RDBMS or something like Berkeley DB. You could create keys/indicies based on sub-hypercubes of the various dimensions to speed things up - start at the bucket of the item to be classified and move outward...

dfb
  • 13,133
  • 2
  • 31
  • 52
  • I haven't seen any RDBM support for KNN classification, outside of maybe the GIS standard, which is mostly only supported by expensive proprietary systems. I'm not sure what you mean by creating keys/indicies with "hypercubes". Could you please cite some sources? – Cerin Apr 06 '11 at 19:34
  • You'll have to roll your own if you use a RDBMS. If you have a large dataset, you could store in a BDB or RDMNS all the pairs and then index them along each dimension. For two dimensions, this would be like drawing a grid over space of the parameters. You would then look up the cell and adjacent cell for the nearest items. No sources, just an idea. – dfb Apr 06 '11 at 21:10
0

This is done in Lucene already with geospatial searches. Of course, the built-in geospatial searches only use two dimensions, so you'll have to modify it a bit. But the basic idea of using numeric range queries will work.

(Note: I'm not aware of anyone doing high-dimensional kNN with Lucene. So I can't comment on how fast it will be.)

Xodarap
  • 11,581
  • 11
  • 56
  • 94