Zend Lucene exhausts memory when indexing

Question

An oldish site I'm maintaining uses Zend Lucene (ZF 1.7.2) as it's search engine. I recently added two new tables to be indexed, together containing about 2000 rows of text data ranging between 31 bytes and 63kB.

The indexing worked fine a few times, but after the third run or so it started terminating with a fatal error due to exhausting it's allocated memory. The PHP memory limit was originally set to 16M, which was enough to index all other content, 200 rows of text at a few kilobytes each. I gradually increased the memory limit to 160M but it still isn't enough and I can't increase it any higher.

When indexing, I first need to clear the previously indexed results, because the path scheme contains numbers which Lucene seems to treat as stopwords, returning every entry when I run this search:

$this->index->find('url:/tablename/12345');

After clearing all of the results I reinsert them one by one:

foreach($urls as $v) {
   $doc = new Zend_Search_Lucene_Document();
   $doc->addField(Zend_Search_Lucene_Field::UnStored('content', $v['data']);
   $doc->addField(Zend_Search_Lucene_Field::Text('title', $v['title']);
   $doc->addField(Zend_Search_Lucene_Field::Text('description', $v['description']);
   $doc->addField(Zend_Search_Lucene_Field::Text('url', $v['path']);
   $this->index->addDocument($doc);
}

After about a thousand iterations the indexer runs out of memory and crashes. Strangely doubling the memory limit only helps a few dozen rows.

I've already tried adjusting the MergeFactor and MaxMergeDocs parameters (to values of 5 and 100 respectively) and calling $this->index->optimize() every 100 rows but neither is providing consistent help.

Clearing the whole search index and rebuilding it seems to result in a successful indexing most of the time, but I'd prefer a more elegant and less CPU intensive solution. Is there something I'm doing wrong? Is it normal for the indexing to hog so much memory?

I found breaking down the source data into smaller chunks and unsetting each recordset after indexing seemed to do the trick. ie. several DB calls using `LIMIT` and `OFFSET` — gawpertron, Jul 04 '12 at 16:54

score 1 · Answer 1 · answered Aug 07 '12 at 03:22

I had a similar problem for a site I had to maintain that had at least three different languages and had to re-index the same 10'000+ (and growing) localized documents for each different locale separately (each using their own localized search engine). Suffice to say that it failed usually within the second pass.

We ended up implementing an Ajax based re-indexing process that called the script a first time to initialize and start re-indexing. That script aborted at a predefined number of processed documents and returned a JSON value indicating if it was completed or not, along with other progress information. We then re-called the same script again with the progress variables until the script returned a completed state.

This allowed also to have a progress bar of the process for the admin area.

For the cron job, we simply made a bash script doing the same task but with exit codes.

This was about 3 years ago and nothing has failed since then.

Thank you for downvoting without any reason. :) – Yanick Rochon Mar 24 '14 at 14:42 — Yanick Rochon, Mar 24 '14 at 14:42
You can have it back now Yanick ;) – Scott Rowley Sep 15 '15 at 19:53 — Scott Rowley, Sep 15 '15 at 19:53

Zend Lucene exhausts memory when indexing

1 Answers1