I have a script that retrieves user data from a CSV (~2.5m) and record data from Salesforce via API (~2m) and matches them based on a unique user_id
.
For each user, I need the relevant record_id
(if it exists). There is a one-to-one relationship with users and records, so the user_id
should only appear on 1 record.
To try and increase performance both lists are sorted ascending by user_id
, and I break the loop if record['user_id'] > user['user_id']
as that means there is no relevant record.
It's working, however it's slow when trying to match the 2 datasets taking ~1.5hrs. Is there a faster method of performing the matching to retrieve the relevant record_id
?
Here is an example of the data, current function, and expected result:
users = [
{"user_id": 11111, "name": "Customer A", "age": 34, 'record_id': None},
{"user_id": 22222, "name": "Customer B", "age": 18, 'record_id': None},
{"user_id": 33333, "name": "Customer C", "age": 66, 'record_id': None}
]
records = [
{"user_id": 11111, "record_id": "ABC123"},
{"user_id": 33333, "record_id": "GHI789"}
]
upload = []
for user in users:
for record in records:
if user['user_id'] == record['user_id']:
user['record_id'] = record['record_id']
records.remove(record)
break
elif record['user_id'] > user['user_id']:
break
if user['record_id']:
upload.append(user)
print(upload)
This outputs:
[
{'user_id': 11111, 'name': 'Customer A', 'age': 34, 'record_id': 'ABC123'},
{'user_id': 33333, 'name': 'Customer C', 'age': 66, 'record_id': 'GHI789'}
]