This is mostly a follow-up to @Anony-Mousse's post, which is quite on-target.
Indexes need to be added to the database by the user. There current is no automatic indexing (as any index will require extra memory and construction time). -db.index
is the parameter for this. Support for automatic indexing is on the wish list, but it requires carefully tuned cost models. On small data set or high dimensional data, or when the user doesn't need this type of queries at all, adding an index will come at a cost.
The database will forward the query request to each index in order. The first index to offer acceleration wins. If no index returns an accelerated query, the database will fall back to a linear scan, unless the hint DatabaseQuery.HINT_OPTIMIZED_ONLY
was given. In this case, null
will be returned. An linear scan can be forced via QueryUtil
, which is mostly useful for unit testing indexes.
M-Trees can work with any numeric distance, but if the distance is not metric the results may be incorrect. An error should be reported if a distance function does not report isMetric()
as true.
R-Trees can work with any distance function that implements SpatialPrimitiveDistanceFunction
, which essentially means implementing a lower bound point-to-rectangle distance. A lower bound can be found for many distance functions, but effectiveness can vary. For example, angular distances will benefit much less from the rectangular pages the R-tree uses.
As for the run
method. The preferred signature for usual vector-space methods is
YourResultType run(Database database, Relation<V> relation)
As of now, the database can actually be obtained via relation.getDatabase()
, but this may change in the future. There is a number of situations where this is problemantic, and some situations where is currently can't be easily removed, unfortunately. Anyway, this is the explicit form, which is convenient to run the algorithms from Java code, i.e. it allows me to specify which relation to use, instead of having to use a database where this is the only appropriate relation (so it gets chosen automatically).
I do have plans to make this even more explicit on the long run, adding explicit support for choosing a data subset to process, and maybe also the queries. The abstract parent run
method would then take care of this. An automatic optimizer would rely on this: it would first query all algorithms to be run for their requirements, including query requirements. Based on the queries, data set, memory available etc. the optimizer could then choose appropriate indexes, and pass the algorithm the appropriate query methods.
To keep the run
signature simple, it will likely be handled via some Instance
classes and more use of the factory pattern instead. But don't worry about it now.
If you want to understand why we need this, have a look at e.g. geospatial outlier detection algorithms. The signature used by SLOM
for example is:
OutlierResult run(Database database, Relation<N> spatial, Relation<O> relation)
i.e. SLOM
uses two two relations. The first relation is the spatial relationship of the instances, e.g. geographic positions. The second relation is the actual data, e.g. measurements. The geographic positions are used to determine which instances are expected to be similar (but these could also be e.g. Polygons!), while the second relation specifies the data that is actually then compared for similarity.