Need Troubleshooting Advice on Vagrant Hosted Solr Instance

Question

TL;DR

I have a vagrant hosted solr index on a windows 10 machine that fails and stops responding (connection reset by peer) without any exceptions in the solr logs. How can I start to debug what is going wrong?

Use Case/Problem

I am attempting to index a constant stream of user account data that has numerous deletes and updates per request. There an update to the stream of data every 4 to 5 seconds.

Everything seems to run smoothly until the solr index gets to ~5.5 million records. Then it fails without error or exception in the solr logs. The error the client receives is a Connection Reset by Peer. Looking at the solr vm, the solr instances has stopped running.

Here is the output of ps -aux | grep sorl right after solr stops running:

 solr      3048  0.0  0.0  16256  3612 ?        Ss   17:23   0:00/lib/systemd/systemd --user
 solr      3049  0.0  0.0 167420  3028 ?        S    17:23   0:00 (sd-pam)

Then after a minute or two, the processes above disappear and there are no more solr processes running.

On inspecting the solr logs there are no errors or exceptions found.

VM Details

Here is the information about the vagrant instance (Vagrantfile).

config.vm.box = "ubuntu/disco64"

...

config.vm.provider "virtualbox" do |v|
    v.memory 4096 (4 gigs)
    v.cpus 4
end

The latest openjdk-8-jdk is installed.

Solr 8.20 is installed.

The solr service is installed in /vagrant/sorl, so in theory, there should be plenty of disk space. The vagrant instance is installed on an SSD drive that has 216 GB of space left.

Solr Config

I have tried followed this advice, Understanding Transaction Logs, Soft Commit and Commit In SolrCloud for configuring my solr index. I am trying to follow the Heavy (bulk) indexing and Index-heavy, Query-light strategies.

The only real values that I've changed in the default solrconfig.xml is setting openSearcher to true for autoCommit. I made this change so I could see the index as it grows and so I could query some data as the user account data stream is harvested.

<!-- AutoCommit

     Perform a hard commit automatically under certain conditions.
     Instead of enabling autoCommit, consider using "commitWithin"
     when adding documents.

     http://wiki.apache.org/solr/UpdateXmlMessages

     maxDocs - Maximum number of documents to add since the last
               commit before automatically triggering a new commit.

     maxTime - Maximum amount of time in ms that is allowed to pass
               since a document was added before automatically
               triggering a new commit.
     openSearcher - if false, the commit causes recent index changes
       to be flushed to stable storage, but does not cause a new
       searcher to be opened to make those changes visible.

     If the updateLog is enabled, then it's highly recommended to
     have some sort of hard autoCommit to limit the log size.
  -->
<autoCommit>
  <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
  <openSearcher>true</openSearcher>
</autoCommit>

I have increased the memory of the solr index to 2 gigs. Here is the output of ps -aux | grep java when solr is running.

java -server
     -Xms2056m
     -Xmx2056m
     -XX:+UseG1GC
     -XX:+PerfDisableSharedMem
     -XX:+ParallelRefProcEnabled
     -XX:MaxGCPauseMillis=250
     -XX:+UseLargePages
     -XX:+AlwaysPreTouch
     -verbose:gc
     -XX:+PrintHeapAtGC
     -XX:+PrintGCDetails
     -XX:+PrintGCDateStamps
     -XX:+PrintGCTimeStamps
     -XX:+PrintTenuringDistribution
     -XX:+PrintGCApplicationStoppedTime
     -Xloggc:/vagrant/solr//logs/solr_gc.log
     -XX:+UseGCLogFileRotation
     -XX:NumberOfGCLogFiles=9
     -XX:GCLogFileSize=20M
     -Dcom.sun.management.jmxremote
     -Dcom.sun.management.jmxremote.local.only=false
     -Dcom.sun.management.jmxremote.ssl=false
     -Dcom.sun.management.jmxremote.authenticate=false
     -Dcom.sun.management.jmxremote.port=18983
     -Dcom.sun.management.jmxremote.rmi.port=18983
     -Dsolr.log.dir=/vagrant/solr//logs
     -Djetty.port=8983
     -DSTOP.PORT=7983
     -DSTOP.KEY=solrrocks
     -Duser.timezone=UTC
     -Djetty.home=/opt/solr/server
     -Dsolr.solr.home=/vagrant/solr//data
     -Dsolr.data.home=
     -Dsolr.install.dir=/opt/solr
     -Dsolr.default.confdir=/opt/solr/server/solr/configsets/_default/conf -Dlog4j.configurationFile=file:/vagrant/solr//log4j2.xml
     -Xss256k
     -Dsolr.jetty.https.port=8983
     -Dsolr.log.muteconsole
     -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /vagrant/solr//logs
     -jar start.jar
     --module=http

Other Background Information

I have worked with solr before, but never this in-depth or with this much aggressive data churn. My only real professional experience is adding a couple hundred thousand records into solr and then performing some easy queries, deleting the index, and then re-harvesting records back into the index...

Plea

Any friendly advice or comments on how to debug this problem would be greatly appreciated. I have searched and searched, but I cannot find anything that remotely looks like an answer for this problem.

Check the syslog - usually under `/var/log/syslog`. Processes that disappear without any trace is usually caused by being killed by the Out of Memory-killer, where it removes processes based on a heuristic. The Java process that hosts Solr is probably being killed off (and since its the kernel doing it, there will be no messages logged by Java or Solr itself). — MatsLindh, Oct 14 '19 at 07:15
Thank you @MatsLindh. I will check that file out. I have been doing a lot more reading and came across the `oom_solr.sh` and I was trying to think of how to prove that it was out of memory exceptions causing the problem. I will look at the `/var/log/syslog` and if there is information there, I will write up a detailed answer. Thanks again! — hooknc, Oct 14 '19 at 15:25
See https://stackoverflow.com/questions/36255110/solr-service-shut-down-for-no-apprent-reason/36298558#36298558 for some of the same symptoms — MatsLindh, Oct 14 '19 at 20:41
I was able to cause the problem again and you were right @MatsLindh, it is an out of memory error and it did show up in the `/var/log/syslog`. I will write up a more complete answer a bit later. Thanks for all your help and knowledge. They are both much appreciated. — hooknc, Oct 18 '19 at 15:20