4

I'm running a setup with django 1.4, Haystack 2 beta, and ElasticSearch .20. My database is postgresql 9.1, which has several million records. When I try to index all of my data with haystack/elasticsearch, the process times out and I get a message that just says "Killed". So far I've noticed the following:

  1. I do get the number of documents to get indexed, so I'm not getting an error like, "0 documents to index".
  2. Indexing a small set, for example 1000, works just fine.
  3. I've tried hardcoding the timeout in haystack/backends/__init__.py and that seems to have no effect.
  4. I've tried changing options in the elasticsearch.yml also to no avail.

If hardcoding the timeout doesn't work, then how else can I extend the time for indexing? Is there another way to change this directly in ElasticSearch? Or perhaps some batch processing method?

Thanks in advance!

Danny Beckett
  • 20,529
  • 24
  • 107
  • 134
maximus
  • 2,417
  • 5
  • 40
  • 56

3 Answers3

6

I'd venture that the issue is with generating the documents to send to ElasticSearch, and that using the batch-size option will help you out.

The update method in the ElasticSearch backend prepares the documents to index from each provided queryset and then does a single bulk insert for that queryset.

self.conn.bulk_index(self.index_name, 'modelresult', prepped_docs, id_field=ID)

So it looks like if you've got a table with millions of records, running update_index on that indexed model will mean you need to generate those millions of documents and then index them. I would venture this is where the problem is. Setting a batch limit with the --batch-size option should limit the documents generated by queryset slices of your batch size.

bennylope
  • 1,113
  • 2
  • 13
  • 24
  • I tried, it isn't the batch-size, although it is a good place to start. The default batch-size is 1000, which shouldn't be a problem for a server with 4GB's of RAM on AWS. – maximus Feb 04 '13 at 06:22
2

This version of haystack is buggy. The line of code causing the problem was found in the file haystack/management/commands/update_index.py in the following line:

pks_seen = set([smart_str(pk) for pk in qs.values_list('pk', flat=True)])

Is causing the server to run out of memory. However, for indexing, it does not seem to be needed. So, I just changed it to:

pks_seen = set([])

Now it's running through the batches. Thank you everyone that answered!

maximus
  • 2,417
  • 5
  • 40
  • 56
  • my rebuild_index is growing out of memory with haystack 2.3.1. Do you know if this's been fixed? – Adrián May 13 '15 at 14:25
1

Have you watched the memory your process is consuming when you try to index all of those records? Typically when you see "Killed" it means that your system has run out of memory, and the OOM killer has decided to kill your process in order to free up system resources.

girasquid
  • 15,121
  • 2
  • 48
  • 58
  • I haven't, but let's assume that's the case. How would I configure elastic search to handle a huge data load? That's really the crux of the question. I know that if I were using java to interface with a sql db, and needed to process a large dataset, I would use something like spring batch, and processes the data incrementally. Seems weird that elastic search doesn't seem to have an equivalent method. – maximus Jan 28 '13 at 20:34
  • What command are you running to index all of your data? If that's the one dying, I would guess that's where the memory problem is. – girasquid Jan 28 '13 at 20:35
  • I'm running the command using haystack as an abstraction layer. On the command line it's >>> python manage.py rebuild_index – maximus Jan 28 '13 at 20:38
  • 1
    It looks like this is the code you'll want to look through: https://github.com/toastdriven/django-haystack/blob/master/haystack/management/commands/update_index.py. I would look into the batchsize option, which might help. – girasquid Jan 28 '13 at 20:44
  • 3
    Just weighing in to say that ES probably isn't the one falling over. ES can easily handle millions of documents and thousands of inserts per second. In personal benchmarking, I've hit 15-40k/sec using plain inserts, not even the bulk API. I don't know anything about Haystack, but I'd guess it is trying to stuff your entire recordset in memory. – Zach Jan 31 '13 at 17:09
  • 2
    The Haystack `update_index` command (used by `rebuild_index`) is pushing and building your documents, likely from search index templates. May not be the problem, but I'd start looking in that direction. – bennylope Feb 01 '13 at 14:48
  • 1
    @Zach you're right it isn't ES, it's the haystack layer giving the problem. – maximus Feb 04 '13 at 06:24