5

I'm trying to update a csv file with some student figures provided by other sources however they've formatted their csv data slightly differently to ours.

It needs to match students based on three criteras their name, their class and finally the first few letters of the location so for the first few students from Class B are from Dumpt which is actually Dumpton Park.

When matches are found

  • If a student's Scorecard in CSV 2 is 0 or blank then it shouldn't update the score column in CSV 1
  • If a student's Number in CSV 2 is 0 or blank then it shouldn't update the No column in CSV 1
  • Otherwise it should import the numbers from CSV 2 to CSV1

Below is some example data:

CSV 1

Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,
Class A,York,Jim,x,x,10,
Class A,York,Sam,x,x,32,
Class B,Dumpton Park,Sarah,x,x,,
Class B,Dumpton Park,Bob,x,x,,
Class B,Dumpton Park,Bill,x,x,,
Class A,Dover,Andy,x,x,,
Class A,Dover,Hannah,x,x,,
Class B,London,Jemma,x,x,,
Class B,London,James,x,x,,

CSV 2

"Class","Location","Student","Scorecard","Number"
"Class A","York","Jim","0","742"
"Class A","York","Sam","0","931"
"Class A","York","Tom","0","653"
"Class B","Dumpt","Bob","23.1","299"
"Class B","Dumpt","Bill","23.4","198"
"Class B","Dumpt","Sarah","23.5","12"
"Class A","Dover","Andy","23","983"
"Class A","Dover","Hannah","1","293"
"Class B","Lond","Jemma","32.2","0"
"Class B","Lond","James","32.0","0"

CSV 1 UPDATED (This is the desired output)

Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,653
Class A,York,Jim,x,x,10,742
Class A,York,Sam,x,x,32,653
Class B,Dumpton Park,Sarah,x,x,23.5,12
Class B,Dumpton Park,Bob,x,x,23.1,299
Class B,Dumpton Park,Bill,x,x,23.4,198
Class A,Dover,Andy,x,x,23,983
Class A,Dover,Hannah,x,x,1,293
Class B,London,Jemma,x,x,32.2,
Class B,London,James,x,x,32.0,

I would really appreciate any help with this problem. Thanks Oliver

OliverSteph
  • 67
  • 1
  • 10

6 Answers6

9

Here are two solutions: a pandas solution and a plain python solution. First a pandas solution which unsurprisingly looks a whole lot like the other pandas solutions...

First load in the data

import pandas
import numpy as np

cdf1 = pandas.read_csv('csv1',dtype=object)  #dtype = object allows us to preserve the numeric formats
cdf2 = pandas.read_csv('csv2',dtype=object)

col_order = cdf1.columns  #pandas will shuffle the column order at some point---this allows us to reset ot original column order

At this point the data frames will look like

In [6]: cdf1
Out[6]: 
     Class         Local    Name DPE JJK Score   No
0  Class A          York     Tom   x   x    32  NaN
1  Class A          York     Jim   x   x    10  NaN
2  Class A          York     Sam   x   x    32  NaN
3  Class B  Dumpton Park   Sarah   x   x   NaN  NaN
4  Class B  Dumpton Park     Bob   x   x   NaN  NaN
5  Class B  Dumpton Park    Bill   x   x   NaN  NaN
6  Class A         Dover    Andy   x   x   NaN  NaN
7  Class A         Dover  Hannah   x   x   NaN  NaN
8  Class B        London   Jemma   x   x   NaN  NaN
9  Class B        London   James   x   x   NaN  NaN

In [7]: cdf2
Out[7]: 
     Class Location Student Scorecard Number
0  Class A     York     Jim         0    742
1  Class A     York     Sam         0    931
2  Class A     York     Tom         0    653
3  Class B    Dumpt     Bob      23.1    299
4  Class B    Dumpt    Bill      23.4    198
5  Class B    Dumpt   Sarah      23.5     12
6  Class A    Dover    Andy        23    983
7  Class A    Dover  Hannah         1    293
8  Class B     Lond   Jemma      32.2      0
9  Class B     Lond   James      32.0      0

Next manipulate both the data frames into matching formats.

dcol = cdf2.Location 
cdf2['Location'] = dcol.apply(lambda x: x[0:4])  #Replacement in cdf2 since we don't need original data

dcol = cdf1.Local
cdf1['Location'] = dcol.apply(lambda x: x[0:4])  #Here we add a column leaving 'Local' because we'll need it for the final output

cdf2 = cdf2.rename(columns={'Student': 'Name', 'Scorecard': 'Score', 'Number': 'No'})
cdf2 = cdf2.replace('0', np.nan)  #Replacing '0' by np.nan means zeros don't overwrite

cdf1 = cdf1.set_index(['Class', 'Location', 'Name'])
cdf2 = cdf2.set_index(['Class', 'Location', 'Name'])

Now cdf1 and cdf2 look like

In [16]: cdf1
Out[16]: 
                                Local DPE JJK Score   No
Class   Location Name                                   
Class A York     Tom             York   x   x    32  NaN
                 Jim             York   x   x    10  NaN
                 Sam             York   x   x    32  NaN
Class B Dump     Sarah   Dumpton Park   x   x   NaN  NaN
                 Bob     Dumpton Park   x   x   NaN  NaN
                 Bill    Dumpton Park   x   x   NaN  NaN
Class A Dove     Andy           Dover   x   x   NaN  NaN
                 Hannah         Dover   x   x   NaN  NaN
Class B Lond     Jemma         London   x   x   NaN  NaN
                 James         London   x   x   NaN  NaN

In [17]: cdf2
Out[17]: 
                        Score   No
Class   Location Name             
Class A York     Jim      NaN  742
                 Sam      NaN  931
                 Tom      NaN  653
Class B Dump     Bob     23.1  299
                 Bill    23.4  198
                 Sarah   23.5   12
Class A Dove     Andy      23  983
                 Hannah     1  293
Class B Lond     Jemma   32.2  NaN
                 James   32.0  NaN

Updating the data in cdf1 with the data in cdf2

cdf1.update(cdf2, overwrite=False)

results in

In [19]: cdf1
Out[19]: 
                                Local DPE JJK Score   No
Class   Location Name                                   
Class A York     Tom             York   x   x    32  653
                 Jim             York   x   x    10  742
                 Sam             York   x   x    32  931
Class B Dump     Sarah   Dumpton Park   x   x  23.5   12
                 Bob     Dumpton Park   x   x  23.1  299
                 Bill    Dumpton Park   x   x  23.4  198
Class A Dove     Andy           Dover   x   x    23  983
                 Hannah         Dover   x   x     1  293
Class B Lond     Jemma         London   x   x  32.2  NaN
                 James         London   x   x  32.0  NaN

Finally return cdf1 to it's original form and write it to a csv file.

cdf1 = cdf1.reset_index()  #These two steps allow us to remove the 'Location' column
del cdf1['Location']    
cdf1 = cdf1[col_order]     #This will switch Local and Name back to their original order

cdf1.to_csv('temp.csv',index = False)

Two notes: First, given how easy it is to use cdf1.Local.value_counts() or len(cdf1.Local.value_counts()) etc. I'd strongly recommend adding some check summing to make sure that when shifting from Location to the first few letters of Location, you aren't accidentally eliminating a location. Secondly, I sincerely hope there is a typo on line 4 of your desired output.

Onto a plain python solution. In the following, adjust the filenames as needed.

#Open all of the necessary files
csv1 = open('csv1','r')
csv2 = open('csv2','r')
csvout = open('csv_out','w')

#Read past both headers and write the header to the outfile
wstr = csv1.readline()
csvout.write(wstr)
csv2.readline()

#Read csv1 into a dictionary with keys of Class,Name,and first four digits of Local and keep a list of keys for line ordering
line_keys = []
line_dict = {}
for line in csv1:
    s = line.split(',')
    this_key = (s[0],s[1][0:4],s[2])
    line_dict[this_key] = s
    line_keys.append(this_key)

#Go through csv2 updating the data in csv1 as necessary
for line in csv2:
    s = line.replace('\"','').split(',')
    this_key = (s[0],s[1][0:4],s[2])
    if this_key in line_dict:   #Lowers the crash rate...
        #Check if need to replace Score...
        if len(s[3]) > 0 and float(s[3]) != 0:
            line_dict[this_key][5] = s[3]
        #Check if need to repace No...
        if len(s[4]) > 0 and float(s[4]) != 0:
            line_dict[this_key][6] = s[4]
    else:
        print "Line not in csv1: %s"%line

#Write the updated line_dict to csvout
for key in line_keys:
    wstr = ','.join(line_dict[key])
    csvout.write(wstr)
csvout.write('\n')

#Close all of the open filehandles
csv1.close()
csv2.close()
csvout.close()
Chris H.
  • 1,026
  • 7
  • 6
5

You could use fuzzywuzzy to do the matching of town names, and append as a column to df2:

df1 = pd.read_csv(csv1)
df2 = pd.read_csv(csv2)

towns = df1.Local.unique()  # assuming this is complete list of towns

from fuzzywuzzy.fuzz import partial_ratio

In [11]: df2['Local'] =  df2.Location.apply(lambda short_location: max(towns, key=lambda t: partial_ratio(short_location, t)))

In [12]: df2
Out[12]: 
     Class Location Student  Scorecard  Number         Local
0  Class A     York     Jim        0.0     742          York
1  Class A     York     Sam        0.0     931          York
2  Class A     York     Tom        0.0     653          York
3  Class B    Dumpt     Bob       23.1     299  Dumpton Park
4  Class B    Dumpt    Bill       23.4     198  Dumpton Park
5  Class B    Dumpt   Sarah       23.5      12  Dumpton Park
6  Class A    Dover    Andy       23.0     983         Dover
7  Class A    Dover  Hannah        1.0     293         Dover
8  Class B     Lond   Jemma       32.2       0        London
9  Class B     Lond   James       32.0       0        London

Make the name consistent (at the moment Student and Name are misnamed):

In [13]: df2.rename_axis({'Student': 'Name'}, axis=1, inplace=True)

Now you can merge (on the overlapping columns):

In [14]: res = df1.merge(df2, how='outer')

In [15]: res
Out[15]: 
     Class         Local    Name DPE JJK  Score  No Location  Scorecard  Number
0  Class A          York     Tom   x   x     32 NaN     York        0.0     653
1  Class A          York     Jim   x   x     10 NaN     York        0.0     742
2  Class A          York     Sam   x   x     32 NaN     York        0.0     931
3  Class B  Dumpton Park   Sarah   x   x    NaN NaN    Dumpt       23.5      12
4  Class B  Dumpton Park     Bob   x   x    NaN NaN    Dumpt       23.1     299
5  Class B  Dumpton Park    Bill   x   x    NaN NaN    Dumpt       23.4     198
6  Class A         Dover    Andy   x   x    NaN NaN    Dover       23.0     983
7  Class A         Dover  Hannah   x   x    NaN NaN    Dover        1.0     293
8  Class B        London   Jemma   x   x    NaN NaN     Lond       32.2       0
9  Class B        London   James   x   x    NaN NaN     Lond       32.0       0

One bit to clean up is the Score, I think I would take the max of the two:

In [16]: res['Score'] = res.loc[:, ['Score', 'Scorecard']].max(1)

In [17]: del res['Scorecard'] 
         del res['No']
         del res['Location']

Then you're left with the columns you want:

In [18]: res
Out[18]: 
     Class         Local    Name DPE JJK  Score  Number
0  Class A          York     Tom   x   x   32.0     653
1  Class A          York     Jim   x   x   10.0     742
2  Class A          York     Sam   x   x   32.0     931
3  Class B  Dumpton Park   Sarah   x   x   23.5      12
4  Class B  Dumpton Park     Bob   x   x   23.1     299
5  Class B  Dumpton Park    Bill   x   x   23.4     198
6  Class A         Dover    Andy   x   x   23.0     983
7  Class A         Dover  Hannah   x   x    1.0     293
8  Class B        London   Jemma   x   x   32.2       0
9  Class B        London   James   x   x   32.0       0

In [18]: res.to_csv('foo.csv')

Note: to force the dtype to object (and have mixed dtypes, ints and floats, rather than all floats) you can use an apply. I would recommend against this if you're doing any analysis!

res['Score'] = res['Score'].apply(lambda x: int(x) if int(x) == x else x, convert_dtype=False)
Community
  • 1
  • 1
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Looks pretty good, I've tested it out a bit and have found a little error the scores should be written/read as strings or objects or something which I don't know how to do, basically the scores what are integers should say as integers where as the scores what are floats should stay as floats... Any ideas? – OliverSteph Nov 15 '13 at 21:55
  • So pandas like to keep columns of all the same dtype (so if you have some floats and some ints, it'll upcast to floats). You can pass a float_format argument to to_csv, but I'm not sure what you would choose to get int-like floats to print as ints... may be worth asking that aspect as a separate question. http://docs.python.org/2/library/string.html#format-specification-mini-language – Andy Hayden Nov 15 '13 at 22:35
  • @AndyHayden I personally wouldn't recommend float_format as it applies to all columns which is a nuisance and for some people just will not suffice. – Ryflex Nov 15 '13 at 22:47
  • @Hyflex I thought you could pass it a dict of columns? Maybe that's just a feature request... – Andy Hayden Nov 15 '13 at 23:03
  • @AndyHayden I wasn't aware of a way to do that... know exactly how? – Ryflex Nov 15 '13 at 23:29
  • @Hyflex feature request https://github.com/pydata/pandas/issues/4668#issuecomment-27423380 – Andy Hayden Nov 16 '13 at 00:39
  • @AndyHayden As Hyflex mentioned about the dtypes in my data the scores should be kept as integers (if they are) not upcast into floats and keeping the floats as floats, how could I pass it as a dict of columns? – OliverSteph Nov 16 '13 at 21:12
  • @OliverSteph as mentioned that's not really how pandas works (it wants to upcast, so calculations are faster). The passing dict to float_format hasn't been done yet, but I've added a workaround. – Andy Hayden Nov 16 '13 at 21:32
5

Hopefully this code is a bit more readable. ;) The backport for Python's new Enum type is here.

from enum import Enum       # see PyPI for the backport (enum34)

class Field(Enum):

    course = 0
    location = 1
    student = 2
    dpe = 3
    jjk = 4
    score = -2
    number = -1

    def __index__(self):
        return self._value_

def Float(text):
    if not text:
        return 0.0
    return float(text)

def load_our_data(filename):
    "return a dict using the first three fields as the key"
    data = dict()
    with open(filename) as input:
        next(input)  # throw away header
        for line in input:
            fields = line.strip('\n').split(',')
            fields[Field.score] = Float(fields[Field.score])
            fields[Field.number] = Float(fields[Field.number])
            key = (
                fields[Field.course].lower(),
                fields[Field.location][:4].lower(),
                fields[Field.student].lower(),
                )
            data[key] = fields
    return data

def load_their_data(filename):
    "return a dict using the first three fields as the key"
    data = dict()
    with open(filename) as input:
        next(input)  # throw away header
        for line in input:
            fields = line.strip('\n').split(',')
            fields = [f.strip('"') for f in fields]
            fields[Field.score] = Float(fields[Field.score])
            fields[Field.number] = Float(fields[Field.number])
            key = (
                fields[Field.course].lower(),
                fields[Field.location][:4].lower(),
                fields[Field.student].lower(),
                )
            data[key] = fields
    return data

def merge_data(ours, theirs):
    "their data is only used if not blank and non-zero"
    for key, our_data in ours.items():
        their_data = theirs[key]
        if their_data[Field.score]:
            our_data[Field.score] = their_data[Field.score]
        if their_data[Field.number]:
            our_data[Field.number] = their_data[Field.number]

def write_our_data(data, filename):
    with open(filename, 'w') as output:
        for record in sorted(data.values()):
            line = ','.join([str(f) for f in record])
            output.write(line + '\n')

if __name__ == '__main__':
    ours = load_our_data('one.csv')
    theirs = load_their_data('two.csv')
    merge_data(ours, theirs)
    write_our_data(ours, 'three.csv')
Ethan Furman
  • 63,992
  • 20
  • 159
  • 237
  • Works pretty good, however for some reason it keeps reordering them and I can't work out why, any ideas? – OliverSteph Nov 15 '13 at 21:56
  • 1
    @OliverSteph: Two reasons: 1) `dict`s do not preserve order; 2) I'm sorting the data before I print. If you want to maintain the order of your original file you can use an `OrderedDict` instead -- it's in the `collections` module. – Ethan Furman Nov 16 '13 at 04:21
4

Python dictionaries are the way to go here:

studentDict = {}

with open(<csv1>, 'r') as f:
  for line in f:
    LL = line.rstrip('\n').replace('"','').split(',')
    studentDict[LL[0], LL[1], LL[2]] = LL[3:]

with open(<csv2>, 'r') as f:
  for line in f:
    LL = line.rstrip('\n').replace('"','').split(',')
    if LL[-2] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-2] = LL[-2]
    if LL[-1] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-1] = LL[-1]

with open(<outFile>, 'w') as f:
  for k in studentDict.keys():
    v = studentDict[k[0], k[1], k[2]]
    f.write(k[0] + ',' + k[1] + ',' + k[2] + ',' + v[0] + ',' + v[1] + ',' + v[2] + ',' + v[3] + '\n')
metasequoia
  • 7,014
  • 5
  • 41
  • 54
  • Thanks for your reply, that is possibly the most confusing code I've ever come across it works but everything is named in such a confusing way, I'm unable to get it working for my real data. – OliverSteph Nov 12 '13 at 14:20
  • This code would be more readable with a record class that could keep track of these fields. `collections.namedtuple` seems fitting. – IceArdor Nov 20 '13 at 03:58
4

pandas make this sort of task a bit more convenient.

EDIT: Okay since you can't rely on renaming columns manually, Roman's suggestion to just match on the first few letters is a good one. We have to change a couple things before that though.

In [62]: df1 = pd.read_clipboard(sep=',')

In [63]: df2 = pd.read_clipboard(sep=',')

In [68]: df1
Out[68]: 
     Class Location Student  Scorecard  Number
0  Class A     York     Jim        0.0     742
1  Class A     York     Sam        0.0     931
2  Class A     York     Tom        0.0     653
3  Class B    Dumpt     Bob       23.1     299
4  Class B    Dumpt    Bill       23.4     198
5  Class B    Dumpt   Sarah       23.5      12
6  Class A    Dover    Andy       23.0     983
7  Class A    Dover  Hannah        1.0     293
8  Class B     Lond   Jemma       32.2       0
9  Class B     Lond   James       32.0       0

In [69]: df2
Out[69]: 
     Class         Local    Name DPE JJK  Score   No
0  Class A          York     Tom   x   x   32.0  653
1  Class A          York     Jim   x   x   10.0  742
2  Class A          York     Sam   x   x   32.0  653
3  Class B  Dumpton Park   Sarah   x   x   23.5   12
4  Class B  Dumpton Park     Bob   x   x   23.1  299
5  Class B  Dumpton Park    Bill   x   x   23.4  198
6  Class A         Dover    Andy   x   x   23.0  983
7  Class A         Dover  Hannah   x   x    1.0  293
8  Class B        London   Jemma   x   x   32.2  NaN
9  Class B        London   James   x   x   32.0  NaN

Get the columns named the same.

In [70]: df1 = df1.rename(columns={'Location': 'Local', 'Student': 'Name', 'Scorecard': 'Score', 'Number': 'No'}

Now for the locations. Save the originals in df2 to a separate Series.

In [71]: locations = df2['Local']

In [72]: df1['Local'] = df1['Local'].str.slice(0, 4)

In [73]: df2['Local'] = df2['Local'].str.slice(0, 4)

Use the string methods to truncate to the first 4 (assuming this won't cause any false matches).

Now set the indices:

In [78]: df1 = df1.set_index(['Class', 'Local', 'Name'])

In [79]: df2 = df2.set_index(['Class', 'Local', 'Name'])

In [80]: df1
Out[80]: 
                      Score   No
Class   Local Name              
Class A York  Jim       0.0  742
              Sam       0.0  931
              Tom       0.0  653
Class B Dump  Bob      23.1  299
              Bill     23.4  198
              Sarah    23.5   12
Class A Dove  Andy     23.0  983
              Hannah    1.0  293
Class B Lond  Jemma    32.2    0
              James    32.0    0

In [83]: df1 = df1.replace(0, np.nan)
In [84]: df2 = df2.replace(0, np.nan)

Finally, update the scores as before:

In [85]: df1.update(df2, overwrite=False)

You can get the original locations back by doing:

In [91]: df1 = df1.reset_index()
In [92]: df1['Local'] = locations

And you can write to output to csv (and a bunch of other format) with df1.to_csv('path/to/csv')

TomAugspurger
  • 28,234
  • 8
  • 86
  • 69
  • Thanks for replying, do you know a way I can do this without having to rename all of my locations (There are far too many locations to do), simply matching by the first few characters in the location column would suffice. – OliverSteph Nov 12 '13 at 14:21
  • I'd look into the `str` methods on DataFrames. Loop through the each of the unique locations, check for matches (the first 3 letters from the other frame) and replace when there's a match. May be worth asking a separate questions about that specifically. – TomAugspurger Nov 12 '13 at 14:53
  • @OliverSteph may be you can just cut column names to 4 letters in both DataFrames – Roman Pekar Nov 14 '13 at 19:29
  • 1
    @OliverSteph I updated with a workaround for matching the locations based on Roman's suggestion. – TomAugspurger Nov 14 '13 at 20:59
  • @TomAugspurger Looks good, thanks for editing however as mentioned to AndyHayden the scores should be kept as they were (intergers stay as intergers, floats stay as floats... could I use: `dtype={'Score' : 'object'})` as the data doesn't need to be looked at again... – OliverSteph Nov 16 '13 at 21:13
2

You could try using the csv module from the standard library. My solution is very similar to Chris H's, but I used the csv module to read and write the files. (In fact, I stole his technique of storing the keys in a list to save the order).

If you use the csv module, you don't have to worry too much about the quotes, and it also allows you to read the rows directly into dictionaries with the column names as keys.

import csv

# Open first CSV, and read each line as a dictionary with column names as keys.
with open('csv1.csv', 'rb') as csvfile1:
    table1 = csv.DictReader(csvfile1,['Class', 'Local', 'Name',
                            'DPE', 'JJK', 'Score', 'No'])
    table1.next() #skip header row
    first_table = {}
    original_order = [] #list keys to save original order
    # build dictionary of rows with name, location, and class as keys
    for row in table1:
        id = "%s from %s in %s" % (row['Name'], row['Local'][:4], row['Class'])
        first_table[id] = row
        original_order.append(id)

# Repeat for second csv, but don't worry about order
with open('csv2.csv', 'rb') as csvfile2:
    table2 = csv.DictReader(csvfile2, ['Class', 'Location',
                            'Student', 'Scorecard', 'Number'])
    table2.next()
    second_table = {}
    for row in table2:
        id = "%s from %s in %s" % (row['Student'], row['Location'][:4], row['Class'])
        second_table[id] = row

with open('student_data.csv', 'wb') as finalfile:
    results = csv.DictWriter(finalfile, ['Class', 'Local', 'Name',
                             'DPE', 'JJK', 'Score', 'No'])
    results.writeheader()
    # Replace data in first csv with data in second csv when conditions are satisfied.
    for student in original_order:
        if second_table[student]['Scorecard'] != "0" and second_table[student]['Scorecard'] != "":
            first_table[student]['Score'] = second_table[student]['Scorecard']
        if second_table[student]['Number'] != "0" and second_table[student]['Number'] != "":
            first_table[student]['No'] = second_table[student]['Number']
        results.writerow(first_table[student])

Hope this helps.

ConfusedByCode
  • 1,137
  • 8
  • 27