Debugging Jackrabbit Lucene re-index abort/failure

Question

I'm trying to rebuild the Lucene search index on a Jackrabbit 2.0 instance (actually a Day CRX 2.1 instance) so that I can apply new property boost weights for relevancy scoring. However it's repeatably aborting the indexing at the same point, count 3173000

*INFO * MultiIndex: indexing... /content/xxxxxx/jcr:content (3173000) (MultiIndex.java, line 1209)
*INFO * RepositoryImpl: Shutting down repository... (RepositoryImpl.java, line 1139)

(company names redacted) leaving the CRX web instance showing

java.lang.IllegalStateException: The repository is not available.

There's no indication in the logs why it's shutting down. There are no more lines between those two on any higher level of trace. The path mentioned exists and is unremarkable. Jackrabbit logs the path every 100 nodes so it could be any of the next 100 that cause the failure.

Any idea what could possibly have gone wrong, or how I can debug this?

(This, unfortunately, is one of those I'm-out-of-my-depth questions - I can't tell you much more because I don't know where to look.)

Do you really have a node called /content/xxxxxx/jcr:content? That looks a little fishy. Maybe try deleting or renaming that node? — David Gorsline, May 25 '12 at 13:33
@David No, sorry, that was me redacting the log. The real path there contains company names, etc. and is a real path that exists in the repository. There's nothing unusual about it either. MultiIndex only logs every 100 entries so it could be anywhere in the next 100 nodes that causes it to fail. I'm rebuilding jackrabbit-core with more logging and I'll drop that in to see if I can see any more - I suspect there's an exception it's just failing to log. — Rup, May 25 '12 at 14:14
One possible reason for a force shutdown is that the system run out of disk space. However, there should be a message in the log about that. Other than that, I don't know why else the repository would shut down. Except, if asked to shut down (Ctrl+C, stop using the stop script...). Could it be an out of memory problem, or (even less likely) too many open files? — Thomas Mueller, May 27 '12 at 18:45
@Thomas Thanks for the ideas! Disk space: I doubt it - the index we were rebuilding was a few hundred MBs at most (definitely no more than 1.5GB) and we have >750GB free on that partition. I didn't watch it as it failed, though. The Java VM has 64 GB RAM. Too many open files - that's interesting, I hadn't thought of that. However I've left it running over the weekend with extra logging so I might have an answer in the logs tomorrow. — Rup, May 27 '12 at 19:19

Rup · Accepted Answer · 2012-07-24T11:24:51.073

Thanks for everyone's suggestions in the comments. The problem was we had some content with bad HTML: specifically an <li>, closed or not, inside a <select><option>:

<html><body><form>
  <select>
    <option value="1"><li></option>
  </select>
</form></body></html>

This kills javax.swing.text.html.parser.Parser with a StackOverflowError, which is a Throwable and so not caught by the error handling in Jackrabbit MultiIndex.

I've reported the Parser crash to Oracle and I'll propose a patch to Jackrabbit core that adds extra try/catches around the indexing code to at least log the exact node with a problem and, where possible, recover from the error and carry on indexing. In the case of a StackOverflowError I think this is recoverable: by the time we're back in the exception handling code the stack has been unwound to a sensible depth.

In practice I'm not going to be allowed to run a modified Jackrabbit in production here but at least I've identified and fixed the bad content so the same problem won't bite us there.

Debugging Jackrabbit Lucene re-index abort/failure

1 Answers1