0

I have a relatively large csv file containing a list of companies, products, and prices. The ordering of the data is not guaranteed (i.e. not sorted):

csv#1 (big file)        
... 
CompanyA     productB     0
CompanyA     productA     0
CompanyA     productC     0
CompanyB     productA     0
CompanyB     productB     0
CompanyB     productC     0
...     

Some of the entries in "csv#1" have bad data (zeroes). I have a second csv containing only the names from csv#1 that had bad data (and their corrected data). The ordering of this csv is by descending price:

csv#2 (small file - subset of csv#1)        
CompanyA     productC     15
CompanyA     productB     10
CompanyA     productA     5
CompanyB     productA     3
CompanyB     productB     2
CompanyB     productC     1

I want to iterate through csv#1 and if the combination of Company + product is in csv#2, overwrite with csv#2 price.

I know I can do this by brute force, iterating over csv#2 for every row in csv#1. I could even optimize by loading csv#2 into an array and removing entries once they are found (each combination will show up exactly once in csv#1). But I am certain there must be a better way.

I found some references indicating that sets are a more efficient way to do these kinds of lookup searches:

Most efficient way for a lookup/search in a huge list (python)

Fastest way to search a list in python

But I am not sure how to apply sets to my example. How to I structure a set here, given the multiple search columns, and the need to return a value if there is a match? Or is there a better approach than sets?

Community
  • 1
  • 1
Roberto
  • 2,054
  • 4
  • 31
  • 46
  • Your example data appears to be sorted. Is that the case in your actual data? – femtoRgon Nov 06 '13 at 17:02
  • @femtoRgon - Sorry if my simplified data is misleading. The data is sorted, but by descending price, not by any of the name fields. I will revise the question to show this more clearly. – Roberto Nov 06 '13 at 17:05

2 Answers2

2

Since you could technically associate a key with a value, why not use a dictionary? It has constant lookup time O(1) instead of O(N) for a list. It is similar to a set except for the concept of key value pair.

csv1_dict = {  ...,
            "CompanyA productA" : 0,
            "CompanyA productB" : 0,
            ...
            }

csv2_dict = { ...,
            "CompanyA productA" : 10,
              ...
            }
for key,value in csv2_dict.iteritems():
    try:
        csv1_dict[key] = csv2_dict[key]
    except:
        #Key not in csv1

If you can guarantee that Company products in csv2 are in csv1, feel free to remove the try block.

C.B.
  • 8,096
  • 5
  • 20
  • 34
  • Would I iterate though the csv files to build the dicionary? Do I physically join "CompanyA productA" as a string to make the key? Do I really need a dict for csv1? If I need to iterate through it to build the dict, maybe I could just use that iteration to do the comparison (once I have a dict in place for csv2)? – Roberto Nov 06 '13 at 17:31
  • You would have to iterate through the csv regardless, so yes you could just load csv2 as a dict and then when iterating through csv1, just insert a try block and attempt to write the new value to your output, and in the except write the original value. As for a key, it can be whatever you want it to be, as long as it is unique. – C.B. Nov 06 '13 at 17:40
  • I think dict lookup takes O(lg(N)) not O(1), because it uses a binary search on hash codes – saeedgnu Nov 06 '13 at 19:16
  • Looks like the [python wiki](https://wiki.python.org/moin/TimeComplexity) lists it as O(N) worst case assuming you have some implementation of python with a bad hash algorithm for dict that results in chaining, but shows O(1) on average (which is what a good hash function should have). – C.B. Nov 06 '13 at 19:33
  • iteritems is now `items()` in python3 – Timo Feb 02 '23 at 20:40
1

I would suggest loading csv#2 into a dictionary which is actually a hash table and queries are fast

Set is also a hash table without values, but you have values here

The keys of dict are tuples of (companyName, productName) , and values are prices

Then iterate over csv#1 and check if the correction dict has the key for that company name (use has_key, or simple get the key in try ... except block) and if there was, do the fix using associated price value

saeedgnu
  • 4,110
  • 2
  • 31
  • 48
  • I don't think company names will work as keys, because they are not unique. Can I use a tuple as a key (company, product)? Or would I need to join company+product into a key string? – Roberto Nov 06 '13 at 17:28
  • Yes, Tuples are okay for using as keys. (FYI, Lists are not usable as keys because list is not hashable) – saeedgnu Nov 06 '13 at 19:13
  • I updated the answer (sorry now I realize that price for a company doesn't make sense, I didn't notice the meaning of names, just saw the logic! :D ) – saeedgnu Nov 06 '13 at 19:15