CSV joining based on keys

Question

This may be a simple/repeat question, but I could find/figure out yet how to do it.

I have two csv files:

info.csv:

"Last Name", First Name, ID, phone, adress, age X [Total age: 100] |009076

abc, xyz, 1234, 982-128-0000, pqt,

bcd, uvw, 3124, 813-222-1111, tre, 

poi, ccc, 9087, 123-45607890, weq,

and then

age.csv:

student_id,age_1

3124,20

9087,21

1234,45

I want to compare the two csv files, based on the columns "id" from info.csv and "student_id" from age.csv and take the corresponding "age_1" data and put it into the "age" column in info.csv.

So the final output should be:

info.csv:

"Last Name", First Name, ID, phone, adress, age X [Total age: 100] |009076
 abc, xyz, 1234, 982-128-0000, pqt,45
 bcd, uvw, 3124, 813-222-1111, tre,20
 poi, ccc, 9087, 123-45607890, weq,21

I am able to simply join the tables based on the keys into a new.csv, but can't put the data in the columns titles "age". I used "csvkit" to do that.

Here is what I used:

csvjoin -c 3,1 info.csv age.csv > new.csv

Can you post an example of your code? – alexbclay Sep 16 '16 at 16:31 — alexbclay, Sep 16 '16 at 16:31

Daniel · Answer 1 · 2016-09-16T18:42:40.820

3

You can use Pandas and update the info dataframe using the age data. You do it by setting the index of both data frames to ID and student_id respectively, then update the age column in the info dataframe. After that you reset the index so ID becomes a column again.

from StringIO import StringIO
import pandas as pd

info = StringIO("""Last Name,First Name,ID,phone,adress,age X [Total age: 100] |009076
abc, xyz, 1234, 982-128-0000, pqt,
bcd, uvw, 3124, 813-222-1111, tre, 
poi, ccc, 9087, 123-45607890, weq,""")


age = StringIO("""student_id,age_1
3124,20
9087,21
1234,45""")

info_df = pd.read_csv(info, sep=",", engine='python')
age_df = pd.read_csv(age, sep=",", engine='python')

info_df = info_df.set_index('ID')
age_df = age_df.set_index('student_id')
info_df['age X [Total age: 100] |009076'].update(age_df.age_1)
info_df.reset_index(level=0, inplace=True)
info_df

outputs:

    ID      Last Name   First Name      phone           adress  age X [Total age: 100] |009076
0   1234    abc         xyz              982-128-0000   pqt     45
1   3124    bcd         uvw              813-222-1111   tre     20
2   9087    poi         ccc              123-45607890   weq     21

edited Sep 16 '16 at 18:42

answered Sep 16 '16 at 17:29

Daniel

5,095
5
35
48

I can't get this to work, cause in my actual CSV, the last column header have space in it. It is like "age students|all". So when I replace this line of your code: info_df.age.update(age_df.age_1) with the actual header, it gives me syntax error. – user3285014 Sep 16 '16 at 18:18
Would you mind sharing the actual header as well as the syntax error you are getting. Two things, you can rename the header in the csv file or just get the column names using `df.columns` to see the actual names of the columns. – Daniel Sep 16 '16 at 18:20
This is the column name that needs to be updated: u'age X [Total age: 100] |009076' – user3285014 Sep 16 '16 at 18:27
1

That's a weird name for a column. Why dont rename the column of the csv to something more common? – Daniel Sep 16 '16 at 18:33
I can't rename the names of the columns, as it won't be uploaded back to ther server again. The headers are some sort of identifier for the uploading back to the server. I did try this: df1_key.columns[9].update(df2_key.columns[1]) AttributeError: 'str' object has no attribute 'update' – user3285014 Sep 16 '16 at 18:36
Just update your question with the actual column names and it will be easier to update the answer. Im a bit confused by it – Daniel Sep 16 '16 at 18:37
Updated the Column names – user3285014 Sep 16 '16 at 18:55
@user3285014 Its is working with the new age column name – Daniel Sep 16 '16 at 20:14
Thanks a lot...:-) – user3285014 Sep 16 '16 at 21:11

Aidenhjj · Accepted Answer · 2016-09-16T17:41:53.397

Try this...

import csv

info = list(csv.reader(open("info.csv", 'rb')))
age = list(csv.reader(open("age.csv", 'rb')))

def copyCSV(age, info, outFileName = 'out.csv'):
    # put age into dict, indexed by ID
    # assumes no duplicate entries

    # 1 - build a dict ageDict to represent data
    ageDict = dict([(entry[0].replace(' ',''), entry[1]) for entry in age[1:] if entry != []])

    # 2 - setup output
    with open(outFileName, 'wb') as outFile:
        outwriter = csv.writer(outFile)
        # 3 - run through info and slot in ages and write to output
        # nb: had to use .replace(' ','') to strip out whitespaces - these may not be in original .csv
        outwriter.writerow(info[0])
        for entry in info[1:]:
            if entry != []:
                key = entry[2].replace(' ','')
                if key in ageDict: # checks that you have data from age.csv
                    entry[5] = ageDict[key]
            outwriter.writerow(entry)

copyCSV(age, info)

Let me know if it works or if anything is unclear. I've used a dict because it should be faster if your files are massive, as you only have to loop through the data in age.csv once.

There may be a simpler way / something already implemented...but this should do the trick.

CSV joining based on keys

2 Answers2