3

I have a script that retrieves user data from a CSV (~2.5m) and record data from Salesforce via API (~2m) and matches them based on a unique user_id.

For each user, I need the relevant record_id (if it exists). There is a one-to-one relationship with users and records, so the user_id should only appear on 1 record.

To try and increase performance both lists are sorted ascending by user_id, and I break the loop if record['user_id'] > user['user_id'] as that means there is no relevant record.

It's working, however it's slow when trying to match the 2 datasets taking ~1.5hrs. Is there a faster method of performing the matching to retrieve the relevant record_id?

Here is an example of the data, current function, and expected result:

users = [
    {"user_id": 11111, "name": "Customer A", "age": 34, 'record_id': None},
    {"user_id": 22222, "name": "Customer B", "age": 18, 'record_id': None},
    {"user_id": 33333, "name": "Customer C", "age": 66, 'record_id': None}
]

records = [
    {"user_id": 11111, "record_id": "ABC123"},
    {"user_id": 33333, "record_id": "GHI789"}
]

upload = []
for user in users:
    for record in records:
        if user['user_id'] == record['user_id']:
            user['record_id'] = record['record_id']
            records.remove(record)
            break
        elif record['user_id'] > user['user_id']:
            break
    if user['record_id']:
        upload.append(user)

print(upload)

This outputs:

[
 {'user_id': 11111, 'name': 'Customer A', 'age': 34, 'record_id': 'ABC123'}, 
 {'user_id': 33333, 'name': 'Customer C', 'age': 66, 'record_id': 'GHI789'}
]
Daniel Walker
  • 6,380
  • 5
  • 22
  • 45
Matty
  • 448
  • 1
  • 5
  • 13

4 Answers4

3

Create a dictionary that maps from a user's id to its corresponding dictionary. Then, you can add the relevant record_id fields using a for loop. Finally, you can remove the entries without an assigned record_id using a list comprehension.

This doesn't require any preprocessing (e.g. sorting) to obtain speedup; the efficiency gain comes from the fact that lookups in a large dictionary are faster than searching a large list:

user_id_mapping = {entry["user_id"]: entry for entry in users}

for record in records:
    if record["user_id"] in user_id_mapping:
        user_id_mapping[record["user_id"]]["record_id"] = record["record_id"]

result = [item for item in user_id_mapping.values() if item["record_id"] is not None]

print(result)

This outputs:

[
 {'user_id': 11111, 'name': 'Customer A', 'age': 34, 'record_id': 'ABC123'}, 
 {'user_id': 33333, 'name': 'Customer C', 'age': 66, 'record_id': 'GHI789'}
]

With this being said, if you have to execute similar flavors of this operation repeatedly, I would recommend using some sort of a database rather than performing this in Python.

BrokenBenchmark
  • 18,126
  • 7
  • 21
  • 33
  • Alternatively for the `result = ` line, you could use a list comprehension: `result = [x for x in user_id_mapping.values() if x["record_id"] is not None]`. – alani May 24 '22 at 23:38
  • Edited. Why is `filter()` slower than using a list comprehension, if I may ask? – BrokenBenchmark May 24 '22 at 23:40
  • I'd say mainly the function call overhead. – Kelly Bundy May 24 '22 at 23:41
  • To be honest, my list comprehension suggestion was for sake of readability, although I accept that that may be subjective. – alani May 24 '22 at 23:42
  • @alani Yes, I much prefer it for readability as well, I just focused on speed since that's the question's objective. – Kelly Bundy May 24 '22 at 23:43
  • @KellyBundy Is the cost of the function call that significant if the list we're processing has millions of entries? I'm not trying to debate anything here, I'm genuinely curious. – BrokenBenchmark May 24 '22 at 23:45
  • @BrokenBenchmark I don't want to make any strong claims about speed without actually benchmarking it, but I guess that it is the calling the lambda function repeatedly which is the overhead. Clearly `filter` itself is only called once. – alani May 24 '22 at 23:50
  • @alani Yes, the lambda function calls are what I meant. Looks like [more than twice as fast](https://tio.run/##dZJBasMwEEX3OsXgje3iFqcuIRi6aLPvBUIQSiwlAksyIxlSSs7ueqyQlMT9K@vr64004@47HJ2tVh0Og0JnIGgjdQBtOocBGqlE3wZOLoLw0zYyNvrQah@40m2QyFthdo3Iei/R5zWDUShDj3ZKZTGVxRScajhtEpR7hw3XTbIF7cG6AF/OygIiJP9TZO9Mh/IordfOzhXZ6CANKIcwfWgbIaDVZPxbbMtYDL7DZuL9JLSmXA0LUgGJFUaOy2Td@@CoDR/J6IoDmdVbAekVntaQKufSc/EAeyXNwT5vsMXqDkZXnGFVpDnW@sZaLu8vthOYntkWnmBVlrwsS8aoYZy6hcIeZFZdmkq26u2edh6HXMzMJJ4jhXLs5fSTZPnVJNhlbFevQ21DdknC83iwmHIvnNO7OI/RGMuH4Rc). – Kelly Bundy May 24 '22 at 23:55
  • 1
    (Note I iterated over a repeated `users` list instead, as that was simpler to set up. It might also be slightly *faster* than iterating over dict values.) – Kelly Bundy May 24 '22 at 23:55
  • Your answer is similar to my answer.. maybe a bit sexier in some ways ;) But I think that your approach of first filling in the values and then later building the final list with the list comprehension is going to be slower. You're iterating all n-million rows to find the users that have non-null records, but you already knew that when you iterated the first time, so the second iteration is not necessary. In addition, you're doing more dictionary lookups than you really need to. I'm sure that this is highly optimized in python but it still has a cost. – little_birdie May 25 '22 at 20:37
  • Sure, but at the end of the day, speed doesn't matter too much if this is a one-time job. If this operation needs to be done repeatedly or regularly, Python is likely not the best tool for this task (pulling directly from a database would be more desirable). – BrokenBenchmark May 25 '22 at 21:28
  • The question we're trying to answer.. the one that was asked.. is how to make it faster. – little_birdie May 25 '22 at 22:01
  • Right, I think there's a big difference between time on the order of hours versus the order of minutes. But for this case, I wouldn't be too worried about the difference between, say, 6.4 and 6.8 seconds, and even if I was, I would look into using something other than Python for this operation, which is what I discussed at the bottom of my original answer. – BrokenBenchmark May 25 '22 at 22:05
  • 1
    @BrokenBenchmark thanks for this answer, I have implemented this into our code and the time taken to match the data has dropped from ~2hrs to 1.5 seconds. Could not be happier! – Matty May 25 '22 at 23:26
1

You could use pandas.read_csv() to read your CSV data into a dataframe, and then merge that with the records on the user_id value:

import pandas as pd

users = pd.read_csv('csv file')
records = pd.DataFrame('result of salesforce query')

result = users.drop('record_id', axis=1).merge(records, on='user_id')

If you want to keep the users which have no matching value in records, change the merge to

merge(records, on='user_id', how='left')

To output the result as a list of dictionaries, use to_dict():

result.to_dict('records')

Note - it may be possible to execute your Salesforce query directly into a dataframe. See for example this Q&A

Nick
  • 138,499
  • 22
  • 57
  • 95
0

Your approach isn't unreasonable. But removing record after it's used has a cost. Sorting your two lists ahead of time also has a cost. These costs may add up more than you think they do.

One possible approach would be to NOT sort the lists, but instead build a dict of record_ids, eg:

rdict = { r['user_id']:r['record_id'] for r in records }
for user in users:
    user_id = user['user_id']
    record_id = rdict.get(user_id)
    if record_id:
        user['record_id'] = record_id
        upload.append(user)

This way you're paying the price once for building the hash, and everything else is very efficient.

little_birdie
  • 5,600
  • 3
  • 23
  • 28
0

For scalability, you can use pandas dataframes, like so:

result = pd.merge(pd.DataFrame(users), pd.DataFrame(records), on='user_id').to_dict('records')

If you want to keep the entries which do not have a record_id, you can add the how="left" to the arguments of the merge function.

elbashmubarmeg
  • 330
  • 1
  • 9