9

I am planning to store events in elastic search. It can have around 100 million events at any point time. To de-dupe events, I am planning to create _id column of length 100 chars by concatenating below fields entity_id - UUID (37 chars) + event_creation_time (30 chars) + event_type (30 chars)

This store will be having normal reads & writes along with aggregate queries (no updates / deletes) Can you please let me know if there would be any performance impact or any other side-effects of using such lengthy string _id columns instead of default Ids.

Thanks, Harish

Harish
  • 7,589
  • 10
  • 36
  • 47

1 Answers1

5

The _id field is not indexed and not stored by default so there is no performance issue storage wise.

Since you will be indexing millions of documents, the only major performance issue you will face is while bulk indexing. You have to make sure there is a sequential pattern to your _ids. From the Docs

  • If you don’t have a natural ID for each document, use Elasticsearch’s auto-ID functionality. It is optimized to avoid version lookups, since the autogenerated ID is unique.
  • If you are using your own ID, try to pick an ID that is friendly to Lucene. Examples include zero-padded sequential IDs, UUID-1, and nanotime; these IDs have consistent, sequential patterns that compress well. In contrast, IDs such as UUID-4 are essentially random, which offer poor compression and slow down Lucene.

In that blog, long time Lucene committer Michael McCandless compares different ways of _id generation and IMO it is one of the finest articles I have read.

Hope this helps!

ChintanShah25
  • 12,366
  • 3
  • 43
  • 44
  • I don't need to index _id column as I will never query it directly. However if I understand correctly above suggestion of ensuring sequential pattern is in general best practice and my proposal mixing those indices totally breaks this as they are random generated. So even if I don't use bulk indexing, I should not do what I said in my post right? Please clarify. – Harish Jan 03 '16 at 17:18
  • Yes, If it is not something application critical, then it is best to let ES handle _id generation – ChintanShah25 Jan 03 '16 at 18:20
  • 1
    Ok. But identifying duplicate events is a primary requirement of my system and till now I found this is the only way to do it in ElasticSearch. Can you please suggest if there is any better way to achieve the same? – Harish Jan 04 '16 at 04:56
  • ohh, in that case your approach is the right one. I do some hashing in my own application. Also according to [this old post](https://discuss.elastic.co/t/maximum-length-of-a-specified-document-id/4262), there is no limit. – ChintanShah25 Jan 04 '16 at 14:21
  • isn't the blog post say if we don't pick something friendly to Lucene then reads will be slow but not writes? I am trying to understand how bulk writes will be slow if I were to pick say random hash for every doc? – user1870400 Nov 09 '18 at 11:19