I'm using java 7 on ubuntu 14 with the java-driver for mongoDB 3.0.
I have been working on a BigData project using mongoDB for a couple of months now. After analyzing my projects performance I have found a bottleneck. Some of my queries will contain millions of documents as the result. The result I get is a FindIterable type. When I'm performing my calculations I have to iterate through each document, and the mongoDB documentation tells me to do a iterable.forEach. So my code looks like this:
This is one of my queries:
FindIterable<Document> iterable = db.getCollection(dbName).find(
new Document()
.append("timestamp", new Document()
.append("$gte", startTime)
.append("$lte", endTime))
.append("latitude", new Document()
.append("$lte", maxLat))
.append("latitude", new Document()
.append("$gte", minLat))
.append("longitude", new Document()
.append("$lte", maxLong))
.append("longitude", new Document()
.append("$gte", minLong))
);
Im then passing that iterable to my createLayer function.
protected double[][] createLayer(FindIterable<Document> iterable) {
int x = (int) ((maxLat * 100000) - (minLat * 100000));
int y = (int) ((maxLong * 100000) - (minLong * 100000));
final double[][] matrix = new double[x][y];
iterable.forEach(new Block<Document>() {
@Override
public void apply(final Document document) {
//System.out.println(document.get("longitude")+" - "+ document.get("latitude"));
int tempLong = (int) ((Double.parseDouble(document.get("longitude").toString())) * 100000);
int x = (int) (maxLong * 100000) - tempLong;
int tempLat = (int) ((Double.parseDouble(document.get("latitude").toString())) * 100000);
int y = (int) (maxLat * 100000) - tempLat;
matrix[y][x] += 1;
}
});
return matrix;
}
When my iterable contains 3.5 million documents, my run time is about 80 seconds. If I remove my "minor calculations", the run time is about 76 seconds. Obviously my calculations are not the bottleneck here, but the iteration through each document.
I looked at this post on SO, but since I'm not using java 8, lambda operations are unavailable.
So, my question is, is iterable.forEach the fastest way to iterate through a large set of documents? What exactly does the FindIterable contain? Is the iterable.forEach slow because it queries the database? Is the lambda way faster?
Edit: I updated the method with my calculations. It should not matter because when I remove it, the run-time is still very high as stated above.