Combine CSV file data to one CSV file

Question

I have csv files spread around in multiple directories, each of the csv file has only one column containing data. What I want to do is read all these files and bring each file's column into on csv file. Final csv file will have columns with filename as its headers and respective data from its original file as its column data.

This is my directory structure inside ~/csv_files/ ls

ab   arc  bat-smg   bn       cdo  crh      diq  es   fo   gd   haw  ia   iu   ki   ksh  lez  lv   mo   na      no   os   pih  rmy   sah  simple  ss   tet  tr   ur   war  zea
ace  arz  bcl       bo       ce   cs       dsb  et   fr   gl   he   id   ja   kk   ku   lg   map-bms  mr   nah     nov  pa   pl   rn    sc   sk      st   tg   ts   uz   wo   zh
af   as

each directory has two csv files, I thought of using os.walk() function but I think my understanding of the os.walk is incorrect and thats why currently what I have doesn't produce anything.

import sys, os
import csv

root_path = os.path.expanduser(
    '~/data/missing_files')

def combine_csv_files(path):
    for root, dirs, files in os.walk(path):
        for dir in dirs:
            for name in files:
                if name.endswith(".csv"):
                    csv_path = os.path.expanduser(root_path + name)
                    if os.path.exists(csv_path):
                        try:
                            with open(csv_path, 'rb') as f:
                                t = f.read().splitlines()
                                print t
                        except IOError, e:
                            print e

def main():
    combine_csv_files(root_path)

if __name__=="__main__":
    main()

My questions are:

What am I doing wrong here?
Can I read a one csv column from another file and add that data as a column to another file because csv files are more row dependent and here there are no dependency between rows.

At the end i am trying to get csv file like this, (Here are the potential headers)

ab_csv_data_file1, ab_csv_data_file2, arc_csv_data_file1, arc_csv_data_file2

Add `print csv_path` to the innermost `for` loop to ensure the paths are what you expect — mechanical_meat, Apr 16 '13 at 21:28

score 2 · Answer 1 · answered Apr 16 '13 at 21:30

2

You are incorrectly using os.walk()

def combine_csv_files(path):
    for root, dirs, files in os.walk(path):
        for name in files:
            if name.endswith(".csv"):
                csv_path = os.path.join(root, name)
                try:
                    with open(csv_path, 'rb') as f:
                        t = f.read().splitlines()
                        print t
                except IOError, e:
                    print e

The os.walk() function yields a 3-tuple (dirpath, dirnames, filenames). And the "dirpath" is the path of currently walking directory, the "dirnames" is a list of directories in "dirpath", the "filenames" is a list of files in "dirpath". "dirpath" might be the "path" here, and any subfolder of "path".

answered Apr 16 '13 at 21:30

Sheng

3,467
1
17
21

Great! Thanks, How do i get access of the dir name when I write this in new file I want the dirname to be the column header – add-semi-colons Apr 16 '13 at 21:38
A longer dir name would be "root". If you just want the last dir name, use os.path.basename(root). Happy to see my code helpful! – Sheng Apr 16 '13 at 21:40
Just updated the question to give a clear idea on final csv. thanks, – add-semi-colons Apr 16 '13 at 21:51
1

@Null-Hypothesis Just try name+os.path.basename(root) – Sheng Apr 16 '13 at 21:59

score 1 · Answer 2 · answered Apr 16 '13 at 21:40

I don't know whether I understand what you mean. Let's you have multiple folders, such as "ab", "arc" and so on. For each folder, it contains two CSV files.

If I am right, then you are not doing the right thing.

def combine_csv_files(path):
    for root, dirs, files in os.walk(path):
        for dir in dirs:
            for dirpath, sub_dirs, sub_files in os.walk('/'.join([path,dir])
                for name in sub_files:
                    if name.endswith(".csv"):
                        csv_path = os.path.expanduser(dirpath + name)
                        if os.path.exists(csv_path):
                            try:
                                with open(csv_path, 'rb') as f:
                                    t = f.read().splitlines()
                                    print t
                            except IOError, e:
                                print e

The above code should works, if I am right

Personally, I think the second os.walk is useless. And OS independently, os.path.join is better than '/'.join — Sheng, Apr 16 '13 at 21:44
os.path.join is better I agree. However, the second os.walk is necessary. First os.walk get all the sub dirs. And when you iterate each sub dir, you will need second os.walk to find out the files under that sub dir. I mean the approach itself is not optimal, but the idea is right, I guess. — Jerry Meng, Apr 17 '13 at 13:48

Combine CSV file data to one CSV file

2 Answers2