0

Please, read before marking as duplicate. I know the methods to do this and I have read other stack questions on the same topic but all the solutions are in O(n2).

Suppose we have to list of dictionaries like this.

source = [
    {'a': 'value', 'b': 'value', 'c': 'value'},
    {'a': 'value', 'b': 'value', 'c': 'value'},
    {'a': 'value', 'b': 'value', 'c': 'value'},
    {'a': 'value', 'b': 'value', 'c': 'value'},
    {'a': 'value', 'b': 'value', 'c': 'value'},
    {'a': 'value', 'b': 'value', 'c': 'value'},
    {'a': 'value', 'b': 'value', 'c': 'value'},
]
target = [
    {'a': 'value', 'b': 'value', 'c': 'value'},
    {'a': 'value', 'b': 'value', 'c': 'value'},
    {'a': 'value', 'b': 'value', 'c': 'value'},
    {'a': 'value', 'b': 'value', 'c': 'value'},
    {'a': 'value', 'b': 'value', 'c': 'value'},
    {'a': 'value', 'b': 'value', 'c': 'value'},
    {'a': 'value', 'b': 'value', 'c': 'value'},
]

What I want to do here is filter the dictionaries from source on key b which has the same value from any dictionary's key b from target.

So far all the questions, answers and discussions I have seen on this are not very efficient on a large data set. I am expecting dictionaries to be in millions. Source and Target both come from two different database (MySql) hosted on different AWS RDS. I am trying to find the same data and update if it is there or insert if it is not that is why I need this filter.

I am not even sure if this can be achieved in O(n). If that is the case what is most optimised way to do this. Please also let me know if performance can be improved using different data structure.

martineau
  • 119,623
  • 25
  • 170
  • 301
Eternal
  • 928
  • 9
  • 22

2 Answers2

0

This definitely isn't the optimal data format.

If all you care is whether there is any document with the same b, there's no reason to get the full records. Instead (pseudocode), you could get a set of the distinct b values you care about.

target_bs = set(get_first_column("SELECT DISTINCT b FROM target"))

Then you can do (again, pseudocode as far as querying is concerned)

for record in get_records("SELECT * FROM source"):
    if record["b"] not in target_bs:  # efficient set lookup
        # ... insert...

Naturally this would be much more efficient if you were able to connect the two databases and simply do this there (with e.g. a temporary table of bs) instead of roundtripping through Python.

EDIT: Speaking of temporary tables, another option would be

# Get list of unique Bs from first database
existing_bs = set(get_first_column(db1, "SELECT DISTINCT b FROM target"))
# Create and populate a temporary table for them in the second database
execute(db2, "CREATE TEMPORARY TABLE bs (b SOMEDATATYPE PRIMARY KEY)")
insert_many(db2, "INSERT INTO bs (b) VALUES (?)", target_bs)

# Query all records where there is no matching `b` in that temporary table
for record in get_records(db2, "SELECT * FROM source LEFT JOIN bs ON (source.b = bs.b) WHERE bs.b IS NULL"):
   # do something with the record
AKX
  • 152,115
  • 15
  • 115
  • 172
  • Here also with the `in` operator complexity will reach n square, Also I want to keep both same and not same just in different variables – Eternal Feb 16 '21 at 08:59
  • @Eternal No, the average complexity for sets is O(1), see e.g https://stackoverflow.com/a/44080017/51685 – AKX Feb 16 '21 at 09:00
  • while inserting I will need the whole data from source db not just the key I am comparing on so I will have to read it inside the if that means more query. The purpose of doing this in python is to complete whole process of update/insert using just for database queries. Read data from source and target. Filter update/insert. Execute batch update/insert – Eternal Feb 16 '21 at 09:08
  • Yes, the pseudocode above would read the full source record, but only the `b` field for the target records. – AKX Feb 16 '21 at 09:19
0

My solution is pretty simple: you can use set to store all values of 'b' key from source list. Then you can filter from source all values which are in that set.

Result complexity is O(2n) which is O(n).

source = [
    {'a': '1', 'b': 'ad', 'c': 'value'},
    {'a': '2', 'b': 'xs', 'c': 'value'},
    {'a': '3', 'b': '3sda', 'c': 'value'},
    {'a': '4', 'b': 'va', 'c': 'value'},
    {'a': '5', 'b': 'gnghng', 'c': 'value'},
    {'a': '6', 'b': 'nhgnhg', 'c': 'value'},
    {'a': '7', 'b': 'nhnhg', 'c': 'value'},
]
target = [
    {'a': '10', 'b': 'ad', 'c': 'value'},
    {'a': '13', 'b': 'adadas', 'c': 'value'},
    {'a': '16', 'b': 'asda', 'c': 'value'},
    {'a': '18', 'b': 'xs', 'c': 'value'},
    {'a': '21', 'b': 'value', 'c': 'value'},
    {'a': '24', 'b': 'va', 'c': 'value'},
    {'a': '27', 'b': 'sads', 'c': 'value'},
]

dict_source: set = set(item['b'] for item in source)
filtered_from_source: list = [item for item in target if item['b'] in dict_source]
Daniel Bibik
  • 51
  • 1
  • 4
  • 1
    Thanks man! This approach works, I have slightly modified this but yes this is the general solution – Eternal Feb 16 '21 at 09:52