0

I was wondering Spring Mongo API for find loads everything in a List. If search result contains billion records, would it not affect memory? Can someone suggest a better way of achieving this without loading all this in memory. Using limit can help but then there is a flaw that it would not know if a new document is inserted in the collection. Well, find by limit would have the same effect if the collection would have modified after reading X of billion records.

So two questions:

  • Improve performance by not loading everything in memory
  • How would you solve this un-known document added during processing?

Code from API

List<T> result = new ArrayList<T>();

while (cursor.hasNext()) {
    DBObject object = cursor.next();
    result.add(objectCallback.doWith(object));
}
java_dude
  • 4,038
  • 9
  • 36
  • 61

1 Answers1

1

Improve performance by not loading everything in memory

The corresponding user interface for search results would normally have a limit on the number of results that need to be displayed (eg. results per page as well as overall results). I don't think there is any sensible use case to load an unbounded result set into memory, but a good safeguard would be to include a reasonable limit with your application queries.

The MongoDB server returns query results in cursor batches that cannot exceed the maximum BSON document size (16MB as at MongoDB 3.0 .. and in fact normally 1MB for the first batch and 4MB for subsequent batches). You can build a larger result by continuing to iterate the cursor in your application code, but the implementation is your choice.

How would you solve this un-known document added during processing?

Order your search results by a property of new documents that is monotonically increasing -- for example, the default generated ObjectId. Cursors (as at MongoDB 3.0) do not provide isolation from write activity, so documents that are inserted or updated during processing will also be included if applicable to the query order.

If your code is iterating a large query sorted by _id (ascending), new documents inserted using the default ObjectId should appear in the last batches.

Stennie
  • 63,885
  • 14
  • 149
  • 175
  • This is a real use case. In Expedia (your client), our group is also a data holder. Any group can request all the data from beginning of time from us. We encourage fetching data every month, but there can be an instance where a group would just like to start fresh. These groups does analysis on historical data. So the way it was solved, is using bucket limit of 100,000 records. – java_dude Jun 24 '15 at 02:59
  • Which I believe streamed 7-10 million records under 25 minutes between two different data center in two different locations. So the question is now more about how to you fetch the new document or updated document that happened between 25 minutes. I can get new data by sort `_id`, but what about updates? Should I include sort by `update`? – java_dude Jun 24 '15 at 02:59
  • @java_dude Your original question only mentioned discovering new documents being inserted while iterating a large query; updates would have to be handled differently. You could use a sentinel value with a timestamp (eg. last modified date in documents) or set up a process to tail the oplog(s) for modified documents based on namespace and a query filter. There might be more efficient options depending on your data model, but that's a longer discussion than would work in comments here :). – Stennie Jun 24 '15 at 13:43
  • I agree thats another discussion, and I appreciate for your response. – java_dude Jun 24 '15 at 14:39