In my application there is a use case to find the closest point to some other geo point. I decided to use in-memory spatial index and found several candidates: jeospatial and Lucene spatial.
I made some benchmarks and was surprised to find out that Lucene index turned out to be very slow. Here is the code from benchmark which is done with JMH. The full source code can be found in my GitHub repository.
@State(Scope.Thread)
public class MyBenchmark {
// Lucene
private static final String COORDINATES_FIELD = "coordinates";
private static final int GEO_PRECISION_LEVEL = 5;
private static final double NEARBY_RADIUS_DEGREE = DistanceUtils.dist2Degrees(
50, DistanceUtils.EARTH_MEAN_RADIUS_KM);
private final Directory directory = new RAMDirectory();
private final IndexWriterConfig iwConfig = new IndexWriterConfig();
private IndexWriter indexWriter = null;
private IndexSearcher indexSearcher = null;
private final SpatialContext spatialCxt = SpatialContext.GEO;
private final ShapeFactory shapeFactory = spatialCxt.getShapeFactory();
private final SpatialStrategy coordinatesStrategy = new RecursivePrefixTreeStrategy(
new GeohashPrefixTree(spatialCxt, GEO_PRECISION_LEVEL),
COORDINATES_FIELD);
// Jeospatial
private VPTree<SimpleGeospatialPoint> jeospatialPoints = new VPTree<>();
public MyBenchmark() {
try {
indexWriter = new IndexWriter(directory, iwConfig);
} catch (IOException e) {
e.printStackTrace();
}
}
@Setup
public void init() throws IOException {
var r = new Random();
for (int i = 0; i < 3000; i++) {
double latitude = ThreadLocalRandom.current().nextDouble(50.4D, 51.4D);
double longitude = ThreadLocalRandom.current().nextDouble(8.2D, 11.2D);
Document doc = new Document();
doc.add(new StoredField("id", r.nextInt()));
var point = shapeFactory.pointXY(longitude, latitude);
for (var field : coordinatesStrategy.createIndexableFields(point)) {
doc.add(field);
}
doc.add(new StoredField(coordinatesStrategy.getFieldName(), latitude + ":" + longitude));
indexWriter.addDocument(doc);
jeospatialPoints.add(new MyGeospatialPoint(latitude, longitude));
}
indexWriter.forceMerge(1);
indexWriter.close();
final IndexReader indexReader = DirectoryReader.open(directory);
indexSearcher = new IndexSearcher(indexReader);
}
private SimpleGeospatialPoint createRandomPoint() {
final double latitude = ThreadLocalRandom.current().nextDouble(50.4D, 51.4D);
final double longitude = ThreadLocalRandom.current().nextDouble(8.2D, 11.2D);
return new MyGeospatialPoint(latitude, longitude);
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@Fork(value = 1)
@Warmup(iterations = 0)
@Measurement(iterations = 3)
public void benchLucene() {
double latitude = ThreadLocalRandom.current().nextDouble(50.4D, 51.4D);
double longitude = ThreadLocalRandom.current().nextDouble(8.2D, 11.2D);
final var spatialArgs = new SpatialArgs(SpatialOperation.IsWithin,
shapeFactory.circle(longitude, latitude, NEARBY_RADIUS_DEGREE));
final Query q = coordinatesStrategy.makeQuery(spatialArgs);
try {
final TopDocs topDocs = indexSearcher.search(q, 1);
if (topDocs.totalHits == 0) {
return;
}
var doc = indexSearcher.doc(topDocs.scoreDocs[0].doc);
var coordinates = doc.getField(COORDINATES_FIELD).stringValue();
} catch (IOException e) {
e.printStackTrace();
}
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@Fork(value = 1)
@Warmup(iterations = 0)
@Measurement(iterations = 3)
public void benchJeospatial() {
var neighbor = jeospatialPoints.getNearestNeighbor(createRandomPoint(), 50 * 1000);
var n = neighbor.getLatitude();
}
}
In Lucene I'm using RAMDirectory
, but also tried MMapDirectory
. Almost no difference.
The results of the benchmark:
# JMH version: 1.21
# VM version: JDK 10, Java HotSpot(TM) 64-Bit Server VM, 10+46
# VM invoker: /Library/Java/JavaVirtualMachines/jdk-10.jdk/Contents/Home/bin/java
# VM options: <none>
# Warmup: <none>
# Measurement: 3 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: org.sample.MyBenchmark.benchJeospatial
# Run progress: 0,00% complete, ETA 00:01:00
# Fork: 1 of 1
Iteration 1: 77528,657 ops/s
Iteration 2: 81921,096 ops/s
Iteration 3: 83470,405 ops/s
Result "org.sample.MyBenchmark.benchJeospatial":
80973,386 ±(99.9%) 56230,060 ops/s [Average]
(min, avg, max) = (77528,657, 80973,386, 83470,405), stdev = 3082,159
CI (99.9%): [24743,326, 137203,446] (assumes normal distribution)
# JMH version: 1.21
# VM version: JDK 10, Java HotSpot(TM) 64-Bit Server VM, 10+46
# VM invoker: /Library/Java/JavaVirtualMachines/jdk-10.jdk/Contents/Home/bin/java
# VM options: <none>
# Warmup: <none>
# Measurement: 3 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: org.sample.MyBenchmark.benchLucene
# Run progress: 50,00% complete, ETA 00:00:31
# Fork: 1 of 1
Iteration 1: 997,103 ops/s
Iteration 2: 1087,487 ops/s
Iteration 3: 1077,964 ops/s
Result "org.sample.MyBenchmark.benchLucene":
1054,184 ±(99.9%) 906,037 ops/s [Average]
(min, avg, max) = (997,103, 1054,184, 1087,487), stdev = 49,663
CI (99.9%): [148,147, 1960,221] (assumes normal distribution)
# Run complete. Total time: 00:01:03
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark Mode Cnt Score Error Units
MyBenchmark.benchJeospatial thrpt 3 80973,386 ± 56230,060 ops/s
MyBenchmark.benchLucene thrpt 3 1054,184 ± 906,037 ops/s
As you can see Jeospatial is ~75x faster. So I'm curious, if it's actually true or I just misconfigured Lucene somehow.