I'm looking for an efficient and fast solution to filter the ES index, index A (about 40M documents), by the IDs from index B (about 3M documents). And to delete in index B all documents that are not in index A using the filtered IDs.
An ID in index A looks like ABC1D2:XXX (where XXX are numbers). An ID in index B looks like ABC1D2
What I've tried so far is to:
- Cache all IDs from index B
- Cache all IDs from index A
- Filter index B IDs by the IDs from index A. And bulk delete documents from index B by the filtered IDs.
However, it takes 24+ hrs.
What is the best approach to achieve the same but faster? As far as I know in Elastic search we don't have something like SQL left join.