Last days I'm working out in building a productive Hibernate Search Cluster using jgroups HS backend, infinispan directory provider(soft-index-file-store) on a MongoDB (around 30million records). Using OGM massIndexer in a standalone local wildfly was working well with almost no configurations for indexing. Nevertheless, now that I tried to put it in a remote linux cluster, even if I'm using the configurations I saw in several questions (like Indexing huge table with Hibernate Search).
But as I see with OGM MassIndexer I can't use a custom configuration:
2017-12-20 16:58:12,855 WARN [org.hibernate.ogm.massindex.impl.OgmMassIndexer] (default task-1) OGM000031: OgmMassIndexer doesn't support the configuration option 'threadsToLoadObjects'. Its setting will be ignored.
2017-12-20 16:58:12,854 WARN [org.hibernate.ogm.massindex.impl.OgmMassIndexer] (default task-1) OGM000031: OgmMassIndexer doesn't support the configuration option 'idFetchSize'. Its setting will be ignored.
2017-12-20 15:19:10,194 WARN [org.hibernate.ogm.massindex.impl.OgmMassIndexer] (default task-1) OGM000031: OgmMassIndexer doesn't support the configuration option 'threadsToLoadObjects'. Its setting will be ignored.
Doing some digging I found THIS and understood that these features are only for the NON OGM massIndexer, so I can't configure properties in order to optimize the batch indexing job.
Last tries I always get a GC overhead limit exceeded:
[Server:server-one] 17:18:26,987 ERROR [org.hibernate.search.exception.impl.LogErrorHandler] (Hibernate OGM: BatchIndexingWorkspace-1) HSEARCH000058: HSEARCH000116: Unexpected error during MassIndexer operation: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed [Server:server-one] at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:720) [Server:server-one] at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:734) [Server:server-one] at org.apache.lucene.index.IndexWriter.getAnalyzer(IndexWriter.java:1163) [Server:server-one] at org.hibernate.search.backend.impl.lucene.IndexWriterDelegate.(IndexWriterDelegate.java:39) [Server:server-one] at org.hibernate.search.backend.impl.lucene.AbstractWorkspaceImpl.getIndexWriterDelegate(AbstractWorkspaceImpl.java:217) [Server:server-one] at org.hibernate.search.backend.impl.lucene.LuceneBackendTaskStreamer.doWork(LuceneBackendTaskStreamer.java:44) [Server:server-one] at org.hibernate.search.backend.impl.lucene.WorkspaceHolder.applyStreamWork(WorkspaceHolder.java:74) [Server:server-one] at org.hibernate.search.indexes.spi.DirectoryBasedIndexManager.performStreamOperation(DirectoryBasedIndexManager.java:103) [Server:server-one] at org.hibernate.search.backend.impl.StreamingOperationExecutorSelector$AddSelectionExecutor.performStreamOperation(StreamingOperationExecutorSelector.java:106) [Server:server-one] at org.hibernate.search.backend.impl.batch.DefaultBatchBackend.sendWorkToShards(DefaultBatchBackend.java:73) [Server:server-one] at org.hibernate.search.backend.impl.batch.DefaultBatchBackend.enqueueAsyncWork(DefaultBatchBackend.java:49) [Server:server-one] at org.hibernate.ogm.massindex.impl.TupleIndexer.index(TupleIndexer.java:111) [Server:server-one] at org.hibernate.ogm.massindex.impl.TupleIndexer.index(TupleIndexer.java:89) [Server:server-one] at org.hibernate.ogm.massindex.impl.TupleIndexer.runIndexing(TupleIndexer.java:202) [Server:server-one] at org.hibernate.ogm.massindex.impl.TupleIndexer.run(TupleIndexer.java:192) [Server:server-one] at org.hibernate.ogm.massindex.impl.OptionallyWrapInJTATransaction.consumeInTransaction(OptionallyWrapInJTATransaction.java:128) [Server:server-one] at org.hibernate.ogm.massindex.impl.OptionallyWrapInJTATransaction.consume(OptionallyWrapInJTATransaction.java:97) [Server:server-one] at org.hibernate.ogm.datastore.mongodb.MongoDBDialect.forEachTuple(MongoDBDialect.java:762) [Server:server-one] at org.hibernate.ogm.dialect.impl.ForwardingGridDialect.forEachTuple(ForwardingGridDialect.java:168) [Server:server-one] at org.hibernate.ogm.massindex.impl.BatchIndexingWorkspace.run(BatchIndexingWorkspace.java:77) [Server:server-one] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [Server:server-one] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [Server:server-one] at java.lang.Thread.run(Thread.java:748) [Server:server-one] Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded [Server:server-one] at org.apache.lucene.codecs.lucene50.Lucene50PostingsWriter.newTermState(Lucene50PostingsWriter.java:174) [Server:server-one] at org.apache.lucene.codecs.lucene50.Lucene50PostingsWriter.newTermState(Lucene50PostingsWriter.java:57) [Server:server-one] at org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:166) [Server:server-one] at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:1041) [Server:server-one] at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:456) [Server:server-one] at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:198) [Server:server-one] at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:105) [Server:server-one] at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:193) [Server:server-one] at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:95) [Server:server-one] at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4086) [Server:server-one] at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3666) [Server:server-one] at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) [Server:server-one] at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
How I call the MassIndexer:
@PersistenceContext(name = "ogm-persistence")
EntityManager em;
public void createIndex() throws InterruptedException {
FullTextEntityManager ftem = Search.getFullTextEntityManager(em);
ftem.createIndexer(EventEntity.class)
.batchSizeToLoadObjects(30)
.threadsToLoadObjects(4)
.cacheMode(CacheMode.NORMAL)
.startAndWait();
}
my persistence.xml
<property name="hibernate.transaction.jta.platform" value="JBossAS" />
<property name="hibernate.ogm.datastore.provider" value="mongodb"/>
<property name="hibernate.ogm.datastore.database" value="*****"/>
<property name="hibernate.ogm.datastore.host" value="*******"/>
<property name="hibernate.ogm.datastore.port" value="27017"/>
<property name="hibernate.search.default.directory_provider" value="infinispan"/>
<property name="hibernate.search.default.worker.backend" value="jgroups"/>
<property name="hibernate.search.default.exclusive_index_use" value="false"/>
<property name="hibernate.search.lucene_version" value="LUCENE_CURRENT"/>
<property name="hibernate.search.default.optimizer.operation_limit.max" value="10000"/>
<property name="hibernate.search.default.optimizer.transaction_limit.max" value="1000"/>
<property name="hibernate.search.worker.execution" value="sync"/>
<property name="hibernate.search.reader.strategy" value="shared"/>
<property name="hibernate.search.infinispan.chunk_size" value="300000000"/>
<property name="wildfly.jpa.hibernate.search.module" value="none"/>
<property name="hibernate.search.infinispan.configuration_resourcename" value="infinispan-config.xml"/>
my infinispan-config.xml
<cache-container name="hibernate-search" jndi-name="java:jboss/infinispan/container/hibernate-search">
<transport lock-timeout="330000"/>
<replicated-cache name="LuceneIndexesMetadata" mode="SYNC" remote-timeout="330000" >
<locking striping="false" acquire-timeout="330000" concurrency-level="500"/>
<transaction mode="NONE"/>
<expiration max-idle="-1"/>
<state-transfer timeout="480000"/>
<persistence passivation="true">
<soft-index-file-store xmlns="urn:infinispan:config:store:soft-index:8.0" preload="true" fetch-state="true" >
<index path="/var/LuceneIndexesMetadata/index" />
<data path="/var/LuceneIndexesMetadata/data" />
<write-behind/>
</soft-index-file-store>
</persistence>
</replicated-cache>
<replicated-cache name="LuceneIndexesData" mode="SYNC" remote-timeout="25000">
<locking striping="false" acquire-timeout="330000" concurrency-level="500"/>
<state-transfer timeout="480000"/>
<transaction mode="NONE"/>
<eviction strategy="LRU" max-entries="500"/>
<expiration max-idle="-1"/>
<persistence passivation="true">
<soft-index-file-store xmlns="urn:infinispan:config:store:soft-index:8.0" preload="true" fetch-state="true">
<index path="/var/LuceneIndexesData/index" />
<data sync-writes="true" path="/var/LuceneIndexesData/data" />
<write-behind/>
</soft-index-file-store>
</persistence>
</replicated-cache>
<replicated-cache name="LuceneIndexesLocking" mode="SYNC" remote-timeout="25000">
<locking striping="false" acquire-timeout="330000" concurrency-level="500"/>
<transaction mode="NONE"/>
<expiration max-idle="-1"/>
<state-transfer timeout="480000"/>
<persistence passivation="true">
<soft-index-file-store xmlns="urn:infinispan:config:store:soft-index:8.0" preload="true" fetch-state="true">
<index path="/var/LuceneIndexesLocking/index" />
<data path="/var/LuceneIndexesLocking/data" />
<write-behind/>
</soft-index-file-store>
</persistence>
</replicated-cache>
</cache-container>
I need to be able to ensure that the indexing will be fine for even larger number of records that 30million, synchronize the index without problems in case a new stateless node is starting and be able to restart without rebuilding the whole index(persisted Index). Any suggestions accepted for possible architectures and changes in my code.
Thanks a lot.
Wildfly 10, Hibernate Search 5.6.1, Infinispan 8.2.5 from the BOM OGM 5.1
Update:
This is a picture of VisualVM when I get the error: Java Heap Space
This is the heap dump file that was produced by the VisualVM: heapdump