Please, read before marking as duplicate. I know the methods to do this and I have read other stack questions on the same topic but all the solutions are in O(n2).
Suppose we have to list of dictionaries like this.
source = [
{'a': 'value', 'b': 'value', 'c': 'value'},
{'a': 'value', 'b': 'value', 'c': 'value'},
{'a': 'value', 'b': 'value', 'c': 'value'},
{'a': 'value', 'b': 'value', 'c': 'value'},
{'a': 'value', 'b': 'value', 'c': 'value'},
{'a': 'value', 'b': 'value', 'c': 'value'},
{'a': 'value', 'b': 'value', 'c': 'value'},
]
target = [
{'a': 'value', 'b': 'value', 'c': 'value'},
{'a': 'value', 'b': 'value', 'c': 'value'},
{'a': 'value', 'b': 'value', 'c': 'value'},
{'a': 'value', 'b': 'value', 'c': 'value'},
{'a': 'value', 'b': 'value', 'c': 'value'},
{'a': 'value', 'b': 'value', 'c': 'value'},
{'a': 'value', 'b': 'value', 'c': 'value'},
]
What I want to do here is filter the dictionaries from source on key b
which has the same value from any dictionary's key b
from target.
So far all the questions, answers and discussions I have seen on this are not very efficient on a large data set. I am expecting dictionaries to be in millions. Source and Target both come from two different database (MySql) hosted on different AWS RDS. I am trying to find the same data and update if it is there or insert if it is not that is why I need this filter.
I am not even sure if this can be achieved in O(n). If that is the case what is most optimised way to do this. Please also let me know if performance can be improved using different data structure.