0

let's suppose that we have a huge mongodb collection (around 60,000,000 records) and we want to scroll over it without fetching of course a huge amount in memory. For Hibernate orm, it's possible as there is the scrolling api and criteria queries but what could be a solution for pagination in OGM(already spend a lot of time but I couldn't find something)? I'm indexing in batches of 50,000 the objects using the setFirstResult-setMaxResult Api. These are the time in seconds of getting the batches from a query like

em.createNativeQuery(query, Entity.class).setFirstResult(i).setMaxResults(batchSize).getResultList()
results.stream().forEach(res -> fullTextEntityManager.index(res));

by increasing i in each iteration with like i+=batchSize;

I already tried using OgmMassIndexer but I need to be able to start-stop, index specific ranges so I prefer to do it manually.

As it's obvious and logical the time to find the firstResult in each iteration is increasing. Here I have time in secs to find the next batch of 50000 from the beginning from 4 million(setFirstResult(4000000).setMaxResult(50000)):

eg. to go to 4000000 it took 17 seconds etc. to go to 4050000 it took 15 seconds etc. to go to 4100000 it took 12 seconds etc. but later this number increases a lot:

Found: 17 Found: 15 Found: 12 Found: 13 Found: 13 Found: 13 Found: 15 Found: 16 Found: 16 Found: 17 Found: 18 Found: 18 Found: 19 Found: 19 Found: 20 Found: 20 Found: 21 Found: 21 Found: 22 Found: 21 Found: 22 Found: 23 Found: 23 Found: 23 Found: 24 Found: 24 Found: 25 Found: 25 Found: 26 Found: 26 Found: 27 Found: 28 Found: 27 Found: 29 Found: 29 Found: 30 Found: 31 Found: 32 Found: 33 Found: 30 Found: 33 Found: 32 Found: 34 Found: 34 Found: 35 Found: 35 Found: 38 Found: 36 Found: 38 Found: 36 Found: 41 Found: 41 Found: 39 Found: 41 Found: 41 Found: 40 Found: 42 Found: 43 Found: 42 Found: 44 Found: 44 Found: 45 Found: 47 Found: 45 Found: 44 Found: 44 Found: 47 Found: 44 Found: 47 Found: 47 Found: 50 Found: 52 Found: 93

Any options for scrolling mongodb using ogm cursor or something to fetch the objects in session and index them efficiently? I mean this is not viable even for an app that want to paginate that big amount of data using OGM without Hibernate Search, so I suppose there is a solution I don't see.

Thanks a lot.

Hibernate OGM 5.3.1, Hibernate Search 5.9.0 using ElasticSearch

Panos
  • 11
  • 5
  • Hi, out of curiosity, what kind of query are you running? – Davide D'Alto Apr 12 '18 at 16:12
  • Basically, I'm wondering the fields involved in the query are indexed (I mean db indexed not the Search one) – Davide D'Alto Apr 12 '18 at 16:28
  • TypedQuery query = em.createQuery("SELECT entity FROM Entity entity", Entity.class); String query = "db.entity_collection.find({ $query: {}, $orderby: { _id : -1 } })"; to retrieve in reverse order. These are the two queries I tried, and some more using native Queries. The result was always the same. I guess the second that the _id is indexed is what you mean. Then to retrieve results in batches I'm using the way mentioned above. – Panos Apr 13 '18 at 15:31

3 Answers3

3

Scrolling is not supported in OGM yet, because it requires the Criteria API which isn't supported.

That being said, you could implement your process differently.

I am going to assume that the process getting slower and slower comes from the query part (MongoDB having a harder and harder time finding the Nth result) and not from the indexing part (Elasticsearch having a harder and harder time adding documents to the index).

If this is the case, you could try "chunk" queries instead of pagination. The idea would be to first retrieve the first and last ID for the entity type you want to index, then instead of using pagination, run queries with a condition similar to where ID between <last ID in the previous query + 1> AND <last ID in the previous query + page size>.

Provided the ID field has an ascending index in MongoDB, this should get rid of performance getting worse over time.

yrodiere
  • 9,280
  • 1
  • 13
  • 35
  • Didn't try it yet but seems like a feasible solution to my problem. My implementation will use Hibernate Search of course after indexing the data to paginate fast. I just wanted to be sure that I'm not missing something :). Thanks a lot! – Panos Apr 13 '18 at 08:22
1

You need to collect some metrics to understand why it is slowing down, only then we'll be able to suggest an effective solution.

GC

The first suspect is that your JVM is running out of memory; I suspect it might be the case for the MongoDB/Java driver to keep hold on some of the data, possibly more than what we'd expect. Could you enable GC logging on the JVM to verify how it's behaving, or attach any profiler to see if the memory usage stays within reasonable levels during the whole process.

Index size

Any Lucene or Elasticsearch index will slow down a bit during writes while it's growing. This slowdown should not be very significant so I don't think that this is what you're observing, but to make sure the indexing process itself is not in the way you could try running the same process with the blackhole backend.

hibernate.search.default.worker.backend blackhole

N.B. this property requires to not use the Elasticsearch indexmanager so you'd have to temporarily switch the Hibernate Search configuration to the default Lucene indexing mode.

Loading from MongoDB

This is the most likely problem and I'll defer to Yoann's excellent suggestion for this, just make sure that this is actually the problem by checking the previous two points first.

Sanne
  • 6,027
  • 19
  • 34
  • Checked about gc in visual vm-gc and is okay. Used the blackhole and still the same. The problem is the time to find the nth element, I'm sure and probably I'm going to use Yoanns solution. Thanks a lot Sanne. – Panos Apr 13 '18 at 08:27
1

As an alternative solution, in addition to the one mentioned previously, you can also extend the MongoDBDialect and override the method forEachTuple.

This is the one that retrieves the data to index, so if you know in advance how to filter the data you need it might be a solution.

You can then use your new dialect setting the property: hibernate.ogm.datastore.grid_dialect

Foreach method: https://github.com/hibernate/hibernate-ogm/blob/master/mongodb/src/main/java/org/hibernate/ogm/datastore/mongodb/MongoDBDialect.java#L848

The current supplier: https://github.com/hibernate/hibernate-ogm/blob/master/mongodb/src/main/java/org/hibernate/ogm/datastore/mongodb/MongoDBDialect.java#L1924

Right now this method collect all the data in a collection therefore this approach will work only if you are not using the mass indexer anywhere else for a different purpose.

Davide D'Alto
  • 7,421
  • 2
  • 16
  • 30
  • Seems like most complex for me I think so for now I'll stick to Yoann's solution, good to know though. Thank you Davide :) – Panos Apr 13 '18 at 15:35
  • Yes, definitely. I would keep it as a last resort and I mentioned it here in case it might help others. Cheers – Davide D'Alto Apr 16 '18 at 17:04
  • UPDATE: I'm using your way in the end as there is need for using the cursor. Works perfect, using the Tuple next() of MongoDBResultsCursor to first move the cursor to the point I stopped the index(last timestamp that was indexed, this takes some time) and then I continue fast using massIndexer from that point. Pretty robust solution for my problem and easier than it seems to implement. Thanks a lot again! – Panos Apr 20 '18 at 12:40