0

I'm using java 7 on ubuntu 14 with the java-driver for mongoDB 3.0.

I have been working on a BigData project using mongoDB for a couple of months now. After analyzing my projects performance I have found a bottleneck. Some of my queries will contain millions of documents as the result. The result I get is a FindIterable type. When I'm performing my calculations I have to iterate through each document, and the mongoDB documentation tells me to do a iterable.forEach. So my code looks like this:

This is one of my queries:

FindIterable<Document> iterable = db.getCollection(dbName).find(
            new Document()
                    .append("timestamp", new Document()
                            .append("$gte", startTime)
                            .append("$lte", endTime))
                    .append("latitude", new Document()
                            .append("$lte", maxLat))
                    .append("latitude", new Document()
                            .append("$gte", minLat))
                    .append("longitude", new Document()
                            .append("$lte", maxLong))
                    .append("longitude", new Document()
                            .append("$gte", minLong))
    );

Im then passing that iterable to my createLayer function.

protected double[][] createLayer(FindIterable<Document> iterable) {
    int x = (int) ((maxLat * 100000) - (minLat * 100000));
    int y = (int) ((maxLong * 100000) - (minLong * 100000));
    final double[][] matrix = new double[x][y];

    iterable.forEach(new Block<Document>() {

        @Override
        public void apply(final Document document) {
            //System.out.println(document.get("longitude")+" - "+ document.get("latitude"));
            int tempLong = (int) ((Double.parseDouble(document.get("longitude").toString())) * 100000);
            int x = (int) (maxLong * 100000) - tempLong;
            int tempLat = (int) ((Double.parseDouble(document.get("latitude").toString())) * 100000);
            int y = (int) (maxLat * 100000) - tempLat;
            matrix[y][x] += 1;
        }
    });

    return matrix;
}

When my iterable contains 3.5 million documents, my run time is about 80 seconds. If I remove my "minor calculations", the run time is about 76 seconds. Obviously my calculations are not the bottleneck here, but the iteration through each document.

I looked at this post on SO, but since I'm not using java 8, lambda operations are unavailable.

So, my question is, is iterable.forEach the fastest way to iterate through a large set of documents? What exactly does the FindIterable contain? Is the iterable.forEach slow because it queries the database? Is the lambda way faster?

Edit: I updated the method with my calculations. It should not matter because when I remove it, the run-time is still very high as stated above.

Community
  • 1
  • 1
kongshem
  • 322
  • 1
  • 5
  • 23
  • Not really sure what you are asking here. I presume you say the query selection itself returns a dataset containing the number you specify, but the real question is "what are you doing the the results?", i.e "the minor calculations". If you realize that is the bottleneck then perhaps you should explain what those "minor calculations" are actually intended to do. There likely is a better way to process that than fetching all of the results from the server to your client. And that should reduce the time taken dramatically. – Blakes Seven Nov 12 '15 at 09:57
  • See my edit, my calculations are not the bottleneck here. I have to iterate through each document. – kongshem Nov 12 '15 at 11:18
  • Perhaps you should explain what your code is meant to be doing. At a glance, you seem to be basically "grouping" on latitude and longitude combinations to count the occurances. Is this the case, or at least something along those lines? If so then an aggregation operation of some sort on the server would seem more logical than iterating the collection. Which of course is the largest performance hog if you are pulling all this data to the client just to calculate something like that. – Blakes Seven Nov 12 '15 at 11:25
  • Your understanding of the code is correct. Aggregation is also a solution I have been thinking about, good call. This question is meant to ask about what actually happens when doing the iterable.forEach and if there is a faster way of iterating through each document. – kongshem Nov 12 '15 at 11:35
  • 1
    No there is no faster way. The contraint is basically the data transfer. If you contain the aggregation to the server then you have no such overhead. The gains are enormous. If you have any concept of dealing with other databases like relational SQL databases, then the difference is as plain as a `GROUP BY` to asking for all the documents in the collection and summing them up yourself. Indeed you could postulate "why even send query criteria?", when you could just test each result to see if it meets the criteria you want in your client to the database. There is a reason databases support this. – Blakes Seven Nov 12 '15 at 11:41
  • Thanks for your thoughts. Do you have any litterature to back up your statements? :) – kongshem Nov 13 '15 at 08:42
  • Excuse my extreme indignation, But "Are you really asking for that?" What could possibly make you think otherwise that asking a server process to "reduce results" for you rather than asking for "all results" over a network connection and reducing in client/server code would make you think that the latter would be more performant? This is pretty basic stuff that you really need to understand. Documentation exists everywhere. Search for it. – Blakes Seven Nov 13 '15 at 08:46
  • I'm not sure if you understand, or if i misunderstand how mongoDB works. Querying in mongoDB can return a FindIterable type. This implements the iterable interface and one can therefore use the built in forEach method. Im asking if there is another solution for getting a query result and iterating through it other than using the FindIterable type and doing the built in forEach function. I know the server has to process each result, I just wonder if the forEach funtion is slower than another possible iteration method. Clearer? – kongshem Nov 13 '15 at 12:19

0 Answers0