0

I'm currently using Python 2.6. I need to write a script that reads a 'master' csv file and then matches the entries in a second csv file against the master to determine their validity. The master and secondary csv files have the same number of columns with similar values in each. I'm trying to loop through each entry in the secondary csv file and then match them against every entry in the master csv. If the given entry in the secondary csv file matches any of the entries in the master csv, then the entry will be considered valid.

The master csv file looks something like this:

ID_A,ColumnB,ID_C,ColumnD
1,text,0,words
1,text,1,words
2,text,A,words
3,text,CC,words

Where the 'ID' values are driving the validation process and the 'Column' values are auxiliary. First, I need to get this master csv into memory so I can compare entries from a secondary csv against it. To do this, I attempted to read the csv into a dictionary. I then looped through each row, but could only really figure out how to print the values.

with open ('master.csv') as csvfile:
     masterReader = csv.DictReader(csvfile)
     for row in masterReader:
         print(row['ID_A'], row['ID_C'])

Instead of just reading and printing these files I need to figure out a way to store them in memory so I can compare them against entries in the secondary csv, which looks like this:

ColumnA,ColumnB,ID_C,ID_D
text,words,160,7
text,words,250,BB
text,words,1,0
text,words,15,A

Where ID_C is compared against master-ID_A and ID_D is compared against master-ID_C. I think it would be best to test against master-ID_A first, because if there is no match there, it is useless to test against master-ID_C.

I tried using methods from another post I found here and (comparing varied CSV files in python), but couldn't seem to get the results I wanted.

I'd like to make one class with two separate functions that will read a master csv and then validate entries in a secondary csv based on input ID values. I also want to be able to change the input master (with same format) and secondary csv so the script can be used on multiple datasets. When the secondary entries are validated, I'd like to see (ID_C,ID_D,Valid).

I hope this makes sense, I've been wrestling with it all night. Let me know if I can clarify anything here

Community
  • 1
  • 1

2 Answers2

0

So looking at your 2 problems parts.
Read master into memory. Use a dictionary comprehension to reader the whole master file in with a key of row['ID_A'], e.g.:

with open('master.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    master = {row['ID_A']: row for row in reader}

And the second problem how to compare the secondary csv against this master list, again using a list comprehension with a guard is pretty simple:

with open('secondary.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    result = [(row['ID_C'], row['ID_D'], 'Valid') for row in reader 
              if row['ID_C'] in master and master['ID_C']['ID_C'] == row['ID_D']]

The guard checks that the master dictionary key (ID_A) matches the ID_C and if a key exists check master:ID_C == ID_D and only if both of these are True and it to the result list.

Printing out the result will list all the ID_C and ID_D that match ID_A and ID_C from the master list.

I'll leave it as an exercise for the reader to put these into classes.

AChampion
  • 29,683
  • 4
  • 59
  • 75
0

You could read the master file to a set of (ID_A, ID_C) tuples and when you're validating just check if tuple (ID_C, ID_C) exists in there:

import csv

class Validator(object):
    def read_master(self, master):
        with open(master) as f:
            reader = csv.DictReader(f)
            self.master = set((row['ID_A'], row['ID_C']) for row in reader)

    def validate(self, csv_file):
        with open(csv_file) as f:
            reader = csv.DictReader(f)
            keys = ((row['ID_C'], row['ID_D']) for row in reader)
            return [list(key) + [key in self.master] for key in keys]

    def validate_key(self, id_c, id_d):
        return (id_c, id_d) in self.master

Usage:

from pprint import pprint
v = Validator()
v.read_master('master.csv')
pprint(v.validate('other.csv'))
print v.validate_key('1', '1')
print v.validate_key('1', '2')

Output:

[['160', '7', False],
 ['250', 'BB', False],
 ['1', '0', True],
 ['15', 'A', False]]
True
False
niemmi
  • 17,113
  • 7
  • 35
  • 42
  • Thanks for the awesome answer! I have a question on how to tweak the second function slightly. What if instead of inputting a secondary csv to validate all entries, I simply want to pass in individual values 'ID_C' and 'ID_D'? Would it be like this: `def validate(self,ID_C,ID_D): keys = (ID_C, ID_D) return [list(key) + [key in self.master] for key in keys]` ? – lastLeafFallen Jul 19 '16 at 15:35
  • @lastLeafFallen I added example on how to validate single key. – niemmi Jul 19 '16 at 18:11
  • Fantastic, I'm trying to implement now and I keep getting a syntax error in this line: `self.master = {(row['ID_A'], row['ID_C']) for row in reader}` My debugger keeps indicating the syntax error is between the 'o' and 'r' of the word "for". Any idea why? – lastLeafFallen Jul 19 '16 at 22:09
  • It's because I'm running Python 2.6 instead of 2.7 XD – lastLeafFallen Jul 19 '16 at 22:19
  • @lastLeafFallen Luckily that's easy to fix, I changed set comprehension to `set` builtin & generator expression that should work on 2.6. Unfortunately I don't have 2.6 installed so I can't test it myself. – niemmi Jul 20 '16 at 05:45