2

I'm looking for an efficient and fast solution to filter the ES index, index A (about 40M documents), by the IDs from index B (about 3M documents). And to delete in index B all documents that are not in index A using the filtered IDs.

An ID in index A looks like ABC1D2:XXX (where XXX are numbers). An ID in index B looks like ABC1D2

What I've tried so far is to:

  1. Cache all IDs from index B
  2. Cache all IDs from index A
  3. Filter index B IDs by the IDs from index A. And bulk delete documents from index B by the filtered IDs.

However, it takes 24+ hrs.

What is the best approach to achieve the same but faster? As far as I know in Elastic search we don't have something like SQL left join.

Anonymous Creator
  • 2,968
  • 7
  • 31
  • 77
  • Found in this thread https://stackoverflow.com/questions/17497075/efficient-way-to-retrieve-all-ids-in-elasticsearch to use stored_fields so to retrieve only the meta data. This significantly improved the speed. However, if someone can advice for further improvements I will appreciate it. Thank you! – Georgi Georgiev Feb 17 '21 at 15:04

0 Answers0