I'm currently using Python 2.6. I need to write a script that reads a 'master' csv file and then matches the entries in a second csv file against the master to determine their validity. The master and secondary csv files have the same number of columns with similar values in each. I'm trying to loop through each entry in the secondary csv file and then match them against every entry in the master csv. If the given entry in the secondary csv file matches any of the entries in the master csv, then the entry will be considered valid.
The master csv file looks something like this:
ID_A,ColumnB,ID_C,ColumnD
1,text,0,words
1,text,1,words
2,text,A,words
3,text,CC,words
Where the 'ID' values are driving the validation process and the 'Column' values are auxiliary. First, I need to get this master csv into memory so I can compare entries from a secondary csv against it. To do this, I attempted to read the csv into a dictionary. I then looped through each row, but could only really figure out how to print the values.
with open ('master.csv') as csvfile:
masterReader = csv.DictReader(csvfile)
for row in masterReader:
print(row['ID_A'], row['ID_C'])
Instead of just reading and printing these files I need to figure out a way to store them in memory so I can compare them against entries in the secondary csv, which looks like this:
ColumnA,ColumnB,ID_C,ID_D
text,words,160,7
text,words,250,BB
text,words,1,0
text,words,15,A
Where ID_C is compared against master-ID_A and ID_D is compared against master-ID_C. I think it would be best to test against master-ID_A first, because if there is no match there, it is useless to test against master-ID_C.
I tried using methods from another post I found here and (comparing varied CSV files in python), but couldn't seem to get the results I wanted.
I'd like to make one class with two separate functions that will read a master csv and then validate entries in a secondary csv based on input ID values. I also want to be able to change the input master (with same format) and secondary csv so the script can be used on multiple datasets. When the secondary entries are validated, I'd like to see (ID_C,ID_D,Valid).
I hope this makes sense, I've been wrestling with it all night. Let me know if I can clarify anything here