0

What would be the fastest way to check for existing, and prevent adding duplicates to such list:

{'contract_id': 514750, 'value_at': datetime.datetime(2021, 12, 4, 21, 59, 13, 586785), 'value': Decimal('42362.34'), 'metric': 'LastPrice'}
{'contract_id': 514750, 'value_at': datetime.datetime(2021, 12, 4, 21, 59, 17, 827149), 'value': Decimal('42362.35'), 'metric': 'LastPrice'}
{'contract_id': 514750, 'value_at': datetime.datetime(2021, 12, 4, 21, 59, 24, 291007), 'value': Decimal('42362.35'), 'metric': 'LastPrice'}
{'contract_id': 514750, 'value_at': datetime.datetime(2021, 12, 4, 21, 59, 33, 767991), 'value': Decimal('42362.45'), 'metric': 'LastPrice'}
{'contract_id': 514750, 'value_at': datetime.datetime(2021, 12, 4, 21, 59, 35, 880753), 'value': Decimal('42362.60'), 'metric': 'LastPrice'}
{'contract_id': 514750, 'value_at': datetime.datetime(2021, 12, 4, 21, 59, 40, 135887), 'value': Decimal('42362.60'), 'metric': 'LastPrice'}
{'contract_id': 514750, 'value_at': datetime.datetime(2021, 12, 4, 21, 59, 44, 481802), 'value': Decimal('42362.75'), 'metric': 'LastPrice'}
{'contract_id': 514750, 'value_at': datetime.datetime(2021, 12, 4, 21, 59, 46, 618369), 'value': Decimal('42362.95'), 'metric': 'LastPrice'}
{'contract_id': 514750, 'value_at': datetime.datetime(2021, 12, 4, 21, 59, 50, 894044), 'value': Decimal('42362.98'), 'metric': 'LastPrice'}

I was thinking about only checking the last 25 items for existing dictionaries in the list, since it will be updated every second (or faster). Due to DB writing lag I may select not all what is registered within a iteration, so I need to 'rollback' a few seconds (which may result in duplicates). Running, each iteration, through a whole list (which may contain every third of a second registered since midnight) may be a performance killer. Or are there more sophisticated methods?

I have read that messing around with appending to dataframes is slow (and calling the drop_duplicates method every time will not be a solution either I think), so delivering a list would be the best for a dataframe (constructor). Therefore, a list would be better I think. Here are some benchmarks/examples mentioned: Remove duplicate dict in list in Python python remove duplicate dictionaries from a list But these often target the list as a whole (not a part of it).

Mat90
  • 169
  • 1
  • 9
  • `if ... in set(data)` may be a good place to start - data removes all duplicates. Or you could just check `if list(set(data)) == data -> duplicates if false, none if true` – Larry the Llama Dec 05 '21 at 21:49
  • 1
    Can you use a dict of dicts instead of a list of dicts? Since its a list of dicts, set() won't be able to hash the dicts, but I second that a set operation is a good approach. A dict of dicts would allow O(1) lookup times using whatever values from your example qualify rows to be duplicates, as the keys for the dict of dicts. Before adding a new row, you would check that the values don't already exist within the keys, and proceed from there. – Eric Dec 05 '21 at 22:02
  • Thank you @Eric, I can change it to a dict of dicts (using datetime as key). Using the itertools.islice library I can get the last n items of the global dict (I assume it is fast just like the good old days in C when jumping to an array element in sequential memory was just n * element size (size of pointer or struct) although I might have read that Python may organize it differently - in particular when element sizes are not constant) and use that to do a dictionary duplicate key lookup in a smaller subset for speed reasons. – Mat90 Dec 24 '21 at 14:48

1 Answers1

0

The solution will depend on what you consider to be the 'unique key' in your list. If it is merely the timestamp of 'value_at' and the database adds the rows in timestamp order, then you could keep track of the high water mark (highest 'value_at') that you obtained on the previous query and use that as a criteria for the next query.

Another strategy could be to remove the values that meet your next query's criteria from the list before getting its data (assuming that you will be getting them back).

A simplistically brutal solution would be to make a set of the overlapping data that you get from the query based on then next criteria you intend to use and filter out the new data based on that set.

Alain T.
  • 40,517
  • 4
  • 31
  • 51